With the work described in [9], it is possible to browse a

advertisement
Media Annotation in the Video Production Workflow
ABSTRACT
Content provider companies and televisions have large video
archives but usually do not take full advantage of them. In order
to assist the exploration of video archives, we developed
ViewProcFlow, a tool to automatically extract metadata from
video. This metadata is obtained by applying several video
processing methods: cut detection, face detection, images
descriptors for object identification and semantic concepts to
catalog the video content. The goal is to supply the system with
more information to give a better understanding of the content and
also to enable better browse and search functionalities. This tool
integrates a set of algorithms and user interfaces that are
introduced in the workflow of a video production company. This
paper presents an overview of the workflow and describes how
the automatic metadata framework is integrated in this workflow.
Categories and Subject Descriptors
H.5.1 [Multimedia Information Systems]: Video; H.3.3
[Information Storage and Retrieval]: Information Search and
Retrieval – Information filtering.
General Terms
Algorithms, Design, Human Factors.
Keywords
Video Production; Metadata, Video Segmentation, Object
Identification
1. INTRODUCTION
With the global access to high-speed Internet, the content
available, which was previously mainly textual, now has a very
strong multimedia component. This makes televisions networks
and other companies that produce content rethink the way they
work, with the goal of providing more and better content, in a fast
and convenient way. One possible contribution to achieve this is
to reuse old material. By increasing the reused material and
reducing time on capturing footage that is already available, the
workflow processes of those companies can be speeded up (see
Figure 1).
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
Conference’10, Month 1–2, 2010, City, State, Country.
Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00.
Figure 1 - Content Creation Workflow.
The overall process of obtaining media from the initial production
concepts until the archiving phase can be time consuming.
However, the capturing and editing stages correspond to the tasks
that have a major impact on the workflow duration. A more
efficient workflow can provide a better management of the
available manpower and reduce the overall costs. For this
purpose, tools to automate the different tasks that compose the
workflow, in particular the most time consuming are needed.
Currently, most of the information extracted from videos is
obtained manually by adding annotations. This is a hard and
tedious job, which additionally introduces the problem of the user
subjectivity. Therefore, there is a need for tools to create relevant
semantic metadata in order to provide ways to better navigate and
search the video archives. These processes should be automatic
whenever possible and should have a minimum need of human
supervision to increase the extraction performance.
This paper describes our proposals for content annotation to be
included in the workflow (Edition and Capture blocks of Figure 1)
of a multimedia content provider company. Our proposal analyzes
the audiovisual information to extract metadata like scenes, faces
and concepts that will give a better understanding of its content.
This paper is focused on the visual part but we also use the audio
information: Speech-to-Text methods for subtitle generation;
automatic recognition of the best audio source from the footage
and sound environment detection. These audio features are
combined with the visual information to better identify concepts.
The paper is structured as follows. The next sections present an
overview of the VideoFlow project where these questions are
addressed. Section 3 introduces the metadata extraction tools from
image content. Section 4 presents the user interfaces to access
video content enriched with semantic metadata. Finally, we
present conclusions and directions for further development.
2. VIDEOFLOW PROJECT
The work described in this paper is being developed in the scope
of the VideoFlow Project. This projects aims at extending the
Duvideo workflow with tools for the extraction and visualization
of metadata. Duvideo is a multimedia content provider company,
partner of the VideoFlow project. The project also includes the
development of several search and browse interfaces to reuse the
extracted metadata in order to improve the production processes.
This metadata is extracted from the visual (scenes, faces,
concepts) and audio (claps, sound of instrument, animal sounds,
etc.) parts of a video.
2.1 VideoFlow Workflow
Figure 2 presents the workflow proposed to Duvideo, including
our system (ViewProcFlow) for media annotation and
visualization. It starts with the inclusion of videos into the archive,
which is composed of two different types of videos: HD – high
quality source - and Proxy – low quality source.
Figure 2 - VideoFlow Architecture.
When a video is on a tape format, a conversion to digital is needed
so that it can be added to the archive. This step occurs for old
video material, which still is a large segment of the archive.
The new footage is already recorded on a digital format, and MXF
(Material eXchange Format) [1] is the current format that the
cameras use. MXF is a standard format in production
environments, alongside with AAF (Advanced Authoring Format)
and it wraps the essence (video, audio) with metadata. Both
formats incorporate metadata to describe the essence but AAF is
more suited for postproduction and MXF for the final stages of
production. The MXF format is structured on KLV (Key-LengthValue) segments (see Figure 3).
Figure 4 - MXF, Connections Between Packages.
The Material Package represents the final timeline of the file on
post edition. File Package incorporates the essence without any
kind of edition while the Source Package includes all the EDL
(Edit Decision List) created.
Users (journalists, screenwriters, producers and directors) use two
main applications that work with this MXF files: PDZ-1 [2] and
ScheduALL [3]. The PDZ-1 application is used to create the
content for a program by defining stories with the edited footage.
This application will be complemented or replaced to include the
new search and browse features. The ScheduALL application
manages annotations that are used to describe the video content
and will be maintained on the proposed workflow. These
programs work in the Proxy versions of the archive, generating
data that will be used to create the final product with HD content,
in the Unity component.
As mentioned, ViewProcFlow will substitute PDZ and
synchronize data with ScheduALL regarding content annotation.
We took the same approach, using the Proxy videos for all the
processing and creating metadata to be used with HD videos. Our
approach was to first concentrate on various technologies for
metadata extraction. We left for a later development stage the
final format (MXF) generation, because all the technologies work
at the pixel level that can be accessed in the different video
formats.
2.2 VideoProcFlow Architecture
Figure 3 – MXF, File Organization.
The description of the essence and its synchronization is
controlled by three packages (see Figure 4).
The proposed system splits into a Server-side and a Web Clientside. The Server-side does the video processing and deals with the
requests from the Web Client-side. The Client-side is used to view
the video archive and all the metadata associated with it (see
Figure 5). It will also provide mechanism to validate the extracted
content. With the Web Client, users can access the system
wherever they are, not being restricted to a local area or specific
software, as they will only need an Internet connection and a
browser.
The Server-side was implemented using C++ and
openFrameworks [4] (a framework that provides easy API binding
to access video content and also creates an abstract layer that
makes the application independent from the operative system).
The Client-side was developed with Flex 3.0. This technology
provides an easy way to prototype the interface for the final user
without jeopardizing the expected functionality.
content) that happens in the beginning and at the end of a scene
(see Figure 6).
The next two sections explain the video processing task
performed in the server (section 3) and the client-side with Web
user interface used to search and to browse media content (section
4).
Normally this occurs when there are effects between scenes – e.g.,
fade or dissolve. The frames obtained in this way are the input of
the following techniques.
Figure 6 - Frames With Noise.
3.2 Face Detection
Faces are pervasive in video content and provide preliminary
indexing. We integrated the Viola and Jones [6] algorithm to
detect faces that appear on images. It works with Integral Images,
which makes the algorithm to compute convolution filters on
areas work in a very fast way.
The Viola and Jones algorithm is based on a set of cascades of
classifiers, previously trained, that are applied in image regions.
This algorithm has some limitations, for instance, it does not
detects partial faces or faces in a profile view. It also produces
some false positives. To overcome these problems the user will be
included in the process in order to eliminate them.
3.3 Image Descriptors
For video access, some queries require the comparison of an
images or image region. Our proposal uses the information
extracted with the Scale-Invariant Feature Transform (SIFT) [7]
and with the Speeded Up Robust Features (SURF) [8] to compare
images.
These algorithms find keypoints on images that are invariant to
scale and rotation and extract the descriptor that represents the
area around that keypoint. This descriptor is used for matching
purposes between images. On Figure 7, the red dots identify the
detected keypoints and the blue lines are drawn between those that
match.
Figure 5 - Client and Server Side (ViewProcFlow).
3. VIDEO PROCESSING
As mentioned on section 2, a great portion of the video archive is
still on tape format which means that all the clips from one tape,
when converted to digital, are joined in one final clip. For that
matter, the segmentation is essential to extract the scenes from the
clip and get a better representation of it. All the metadata gathered
is stored in XML (eXtensible Markup Language) files, because it
is the standard format for data to be shared and also to help the
integration with the Web Client.
3.1 Video Segmentation
To accomplish the segmentation of the video, we used a simple
difference of histograms [5] to detect scenes from the video. Once
the scenes are detected, one frame is chosen to identify it. The
middle frame of the shot is selected to represent the whole scene.
This criterion was used to avoid noise (regarding semantic
Figure 7 - Example of Matching Images With Extracted
Keypoints.
3.3.1 SIFT
The keypoints are detected using a filter of difference of
Gaussians applied to the image. The next step computes the
gradient in each pixel in a region of 16x16 pixels around each
keypoint. For each 4x4 pixel block of the region, an histogram is
calculated considering 8 directions of the gradient. The descriptor
is created with these histograms (a vector with 128 values).
3.3.2 SURF
This method uses smaller regions than the SIFT method to create
the feature descriptor. It also uses the Integral Images technique
[6] in the process to increase the performance of the algorithm.
These regions are 8x8 centered over the keypoint, and are divided
on 4x4 blocks, to which is applied Haar wavelet. The results for
each sub-region are added to create a descriptor composed by 64
values.
Table 2 - Some Examples of Concepts Matched With
Thesaurus Categories
Concepts
Thesaurus Category
Car, Bicycle, Vehicle,
Train
4816 – Land Transport
Airplane
4826 – Air and Space Transport
Nature, Plants, Flowers
5211 – Natural Environment
Trees
5636 - Forestry
Partylife
2821 – Social Affairs
Church
2831 – Culture and Religion
Food
60 – Agri-Foodstuffs
3.4 Semantic Concepts
With the work described in [9], it is possible to browse a personal
library of photos based on different concepts (see Table 1).
Photos were classified and could be accessed in several client
applications.
Table 1 - Trained Concepts
Fish
5641 - Fisheries
Baby, Adult, Child,
Teenager, Old Person
2806 – Family, 2816 – Demography
and Population
Mountains, River
72 - Geography
Indoor
Snow
Beach
Nature
We also plan to use human intervention for the abstract categories
in a semi-automatic process, which requires appropriate user
interfaces as described next.
Face
Beach
4. USER INTERFACES
Party
People
Our proposal is based on the Regularized Least Squares (RLS) [9]
classifier that performs binary classification on the database (e.g.,
Indoor versus Outdoor or People versus No People). It also uses a
sigmoid function to convert the output of the classifier in a
pseudo-probability. Each concept is trained using a training set
composed of manually labeled images with and without the
concept. After estimating the parameters of the classifier (that is,
after training), the classifier is able to label new images.
Each image is represented by visual features, which are
automatically extracted. We used the Marginal HSV color
Moments [9] and the features obtained by applying a bank of
Gabor Filters [9] as image representations. Using these classifiers,
the tool was capable of executing interesting queries like “Beach
with People” or “Indoor without People”.
Duvideo usually uses a set of categories to access the archives.
Table 2 presents a subset of the Thesaurus used by Duvideo and a
set of related concepts obtained from the list of concepts used in
ImageCLEF [10] for submissions on “Visual Concept Detection
and Annotation Task”. Currently, we have trained a subset of the
concepts presented in Table 1 and we plan to train most of the
ones in Table 2.
Since we know that for several categories on thesaurus, it is
difficult to identify features due to abstraction level of the subject
– e.g., “Rights and Freedoms” or “Legal form of Organizations”,
we will overcome these difficulties by looking into categories that
we may find connections with several individual concepts using
ontologies
In order to take the most of the metadata produced, the interface
for its visualization and use is a key part of the system.
Preliminary specifications were done based on input from the
potential users. The interface can be divided in two main groups
of functionalities: browsing and searching features. The interface
starts with an overall view of the whole video archive on the right
side and the main search parameters on the left (see Figure 8).
Figure 8 - Web Client, Home Page
The following options are available to the user to specify the
search:


Date: “Before” a specific date, “After” a specific date
or “Between” two dates.
Status: “Valid Videos”, “Videos to Validate”. Once
the videos are processed on the server, they are labeled
as “Videos to Validate”. After the user approves the
metadata, the video is made “Valid”.




Thesaurus: a set of categories to identify the context of
the video such as Science and Technology, Art, Sports,
History among others, as presented above.
Concepts: a second set of options to identify concepts
such as “Indoor”, “Outdoor”, “Nature”, and “People”.
Image: the possibility of conducting an image-based
search.
Text Input: searches into the annotations, titles and all
textual data stored with the video.
In case that the user wants to add an image to the search, it is
possible to use one that already is on the library of images or
upload one to the server. However, sometimes the image chosen
has more elements that the user wants and for that we provide a
simple editor allowing the user to select the area of interest (see
Figure 9), either a rectangle or a circle. Since the search with an
image will be based on the SURF descriptors, the user will see all
the detected ones, which will help on not choosing areas with few
or none descriptors. Also, to make the results more flexible, the
user can specify the minimum percentage of keypoints for a scene
to be considered a hit – i.e., with a lower percentage and the area
selected on Figure 9, a scene that only include the eye or the
mouth of the child, would be added to the result, since a smaller
fraction of keypoints in the selected area would be required for the
scene to be considered a hit.
The extracted metadata (faces, scenes, concepts) is organized in
“Timelines” and when one type is selected, all the correspondent
data will appear and will function as anchors to its position on the
video. In order to give a better perspective of where the data
occurs, marks are added to the video timecode.
Since the automatic algorithms lead to some errors, we provide
validation options for the extracted metadata. For the cases where
there are mistakes - e.g., one scene - it is possible to remove it, but
if it is a more general issue, the user can send the video to be
reprocessed and parameterize the segmentation algorithm or
choose a different one.
5. CONCLUSIONS AND FUTURE WORK
The paper presents the first version of our proposal to integrate
different technologies for metadata extraction to be included in
the workflow of a video production company. The process to
automatically detect scenes on a video is an important feature
because of the large amount of unsegmented videos on tape
format. Even if the process is not perfect and it requires some
minor human corrections, it will save manpower and time.
Regarding the semantic concepts, we will work with Duvideo’s
thesaurus, which is being built based on the EUROVOC standard
[11]. Currently it has 21 domains with 127 micro-thesaurus
described. When this proposal is finished, we can assess what
kind of concepts can be trained to automatically catalog and
search the content, either by training a specific one or by
composing several concepts.
There are also some features that would enrich the current version
of the applications such as:
Figure 9 - Simple Image Editor
When the search is performed, the library view will be updated
with the current results. A popup window will appear once the
user selects one of the videos from the results. It will contain all
the scenes that got a hit from the search, creating a shortcut for all,
and thus facilitating the access to it.
When a video is selected, a visualization screen (see Figure 10)
allows the user to observe the video and all the extracted
metadata.
Figure 10 - Video Visualization

Face recognition: adding an automatic process to
identify persons on videos would bring new
functionalities to the prototype. A first step, could be
gender classification, a technique describe on [12].

The possibility of doing some level of video edition and
creating new stories by cutting and joining scenes.
Regarding the existing tasks, increasing the performance (see
Table 3) with the introduction of parallel computing could lead to
better results. In a similar way, the usage of a native XML
database, like sedan [13], will help on accessing data and
executing queries for the textual parameters.
Table 3 – Tasks Duration
Task
Avg. Duration
Histogram Difference Between
Two Images
0.17s
Face Detection
0.02s
SURF Descriptors Extraction
0.94s
Match Two Images
(With 400 Descriptors Each)
0.21s
Finally, the full integration of the prototype with the MXF format
is a key aspect to be handled next [14]. Interactions between the
development team and the users in the production company are
also a vital component for the success of the project, and this will
be carried on in informal exchanges and more formal user tests
based on questionnaires and task analysis.
[6] Viola, P., and Jones, M. Robust real-time object detection. In
International Journal of Computer Vision (2001).
6. ACKNOWLEDGMENTS
[7] Lowe, D. G. Distinctive image features from scale-invariant
keypoints. International Journal of Computer Vision 60, 2
(2004), 91–110.
This work was partially funded in the scope of Project VideoFlow
(003096), “Iniciativa QREN, do financiamento UE/FEDER,
através do COMPETE – Programa Operacional Factores de
Competitividade”.
Our thanks to Duvideo for all the help and cooperation on all the
life cycles of the project, from requirements elicitation to the
validation of the prototype.
7. REFERENCES
[1] Devlin, B., Wilkinson, J., Beard, M., and Tudor, P. The MXF
Book: Introduction to the Material eXchange Format.
Elsevier, March 2006.
[2] Sony. PDZ-1 Optical XDCAM Proxy Browsing Software.
https://servicesplus.us.sony.biz/sony-software- modelPDZ1.aspx, June 2010.
[3] ScheduALL. http://www.scheduall.com, June 2010.
[4] Lieberman, Z., Watson, T., and Castro, A. openFrameworks.
http://www.openframeworks.cc/, October 2009.
[5] Dailianas, A., Allen, R. B., and England, P. Comparison of
automatic video segmentation algorithms. In In SPIE
Photonics West (1995), pp. 2–16.
[8] H. Bay, T. Tuytelaars, and L. J. V. Gool, “Surf: Speeded up
robust features,” in ECCV (1), pp. 404–417, 2006.
[9] Jesus, R. Recuperação de Informação Multimédia em
Memórias Pessoais. PhD thesis, Universidade Nova de
Lisboa, Faculdade de Ciências e Tecnologias, September
2009.
[10] ImageCLEF, Image Retrieval in CLEF http://www.imageclef.org/2010, June 2010.
[11] EUROVOC - Thesaurus. http://europa.eu/eurovoc/, June
2010.
[12] Grangeiro, F., Jesus, R., and Correia, N. Face recognition
and gender classification in personal memories. In ICASSP
’09: Proceedings of the 2009 IEEE International Conference
on Acoustics, Speech and Signal Processing (Washington,
DC, USA, 2009), IEEE Computer Society, pp. 1945–1948.
[13] MODIS, sedna - Native XML Database System.
http://modis.ispras.ru/sedna/, May 2010.
[14] XML Schema for MXF Metadata. Tech. rep., MOG
Solutions, February 2005
Download