Media Annotation in the Video Production Workflow ABSTRACT Content provider companies and televisions have large video archives but usually do not take full advantage of them. In order to assist the exploration of video archives, we developed ViewProcFlow, a tool to automatically extract metadata from video. This metadata is obtained by applying several video processing methods: cut detection, face detection, images descriptors for object identification and semantic concepts to catalog the video content. The goal is to supply the system with more information to give a better understanding of the content and also to enable better browse and search functionalities. This tool integrates a set of algorithms and user interfaces that are introduced in the workflow of a video production company. This paper presents an overview of the workflow and describes how the automatic metadata framework is integrated in this workflow. Categories and Subject Descriptors H.5.1 [Multimedia Information Systems]: Video; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – Information filtering. General Terms Algorithms, Design, Human Factors. Keywords Video Production; Metadata, Video Segmentation, Object Identification 1. INTRODUCTION With the global access to high-speed Internet, the content available, which was previously mainly textual, now has a very strong multimedia component. This makes televisions networks and other companies that produce content rethink the way they work, with the goal of providing more and better content, in a fast and convenient way. One possible contribution to achieve this is to reuse old material. By increasing the reused material and reducing time on capturing footage that is already available, the workflow processes of those companies can be speeded up (see Figure 1). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00. Figure 1 - Content Creation Workflow. The overall process of obtaining media from the initial production concepts until the archiving phase can be time consuming. However, the capturing and editing stages correspond to the tasks that have a major impact on the workflow duration. A more efficient workflow can provide a better management of the available manpower and reduce the overall costs. For this purpose, tools to automate the different tasks that compose the workflow, in particular the most time consuming are needed. Currently, most of the information extracted from videos is obtained manually by adding annotations. This is a hard and tedious job, which additionally introduces the problem of the user subjectivity. Therefore, there is a need for tools to create relevant semantic metadata in order to provide ways to better navigate and search the video archives. These processes should be automatic whenever possible and should have a minimum need of human supervision to increase the extraction performance. This paper describes our proposals for content annotation to be included in the workflow (Edition and Capture blocks of Figure 1) of a multimedia content provider company. Our proposal analyzes the audiovisual information to extract metadata like scenes, faces and concepts that will give a better understanding of its content. This paper is focused on the visual part but we also use the audio information: Speech-to-Text methods for subtitle generation; automatic recognition of the best audio source from the footage and sound environment detection. These audio features are combined with the visual information to better identify concepts. The paper is structured as follows. The next sections present an overview of the VideoFlow project where these questions are addressed. Section 3 introduces the metadata extraction tools from image content. Section 4 presents the user interfaces to access video content enriched with semantic metadata. Finally, we present conclusions and directions for further development. 2. VIDEOFLOW PROJECT The work described in this paper is being developed in the scope of the VideoFlow Project. This projects aims at extending the Duvideo workflow with tools for the extraction and visualization of metadata. Duvideo is a multimedia content provider company, partner of the VideoFlow project. The project also includes the development of several search and browse interfaces to reuse the extracted metadata in order to improve the production processes. This metadata is extracted from the visual (scenes, faces, concepts) and audio (claps, sound of instrument, animal sounds, etc.) parts of a video. 2.1 VideoFlow Workflow Figure 2 presents the workflow proposed to Duvideo, including our system (ViewProcFlow) for media annotation and visualization. It starts with the inclusion of videos into the archive, which is composed of two different types of videos: HD – high quality source - and Proxy – low quality source. Figure 2 - VideoFlow Architecture. When a video is on a tape format, a conversion to digital is needed so that it can be added to the archive. This step occurs for old video material, which still is a large segment of the archive. The new footage is already recorded on a digital format, and MXF (Material eXchange Format) [1] is the current format that the cameras use. MXF is a standard format in production environments, alongside with AAF (Advanced Authoring Format) and it wraps the essence (video, audio) with metadata. Both formats incorporate metadata to describe the essence but AAF is more suited for postproduction and MXF for the final stages of production. The MXF format is structured on KLV (Key-LengthValue) segments (see Figure 3). Figure 4 - MXF, Connections Between Packages. The Material Package represents the final timeline of the file on post edition. File Package incorporates the essence without any kind of edition while the Source Package includes all the EDL (Edit Decision List) created. Users (journalists, screenwriters, producers and directors) use two main applications that work with this MXF files: PDZ-1 [2] and ScheduALL [3]. The PDZ-1 application is used to create the content for a program by defining stories with the edited footage. This application will be complemented or replaced to include the new search and browse features. The ScheduALL application manages annotations that are used to describe the video content and will be maintained on the proposed workflow. These programs work in the Proxy versions of the archive, generating data that will be used to create the final product with HD content, in the Unity component. As mentioned, ViewProcFlow will substitute PDZ and synchronize data with ScheduALL regarding content annotation. We took the same approach, using the Proxy videos for all the processing and creating metadata to be used with HD videos. Our approach was to first concentrate on various technologies for metadata extraction. We left for a later development stage the final format (MXF) generation, because all the technologies work at the pixel level that can be accessed in the different video formats. 2.2 VideoProcFlow Architecture Figure 3 – MXF, File Organization. The description of the essence and its synchronization is controlled by three packages (see Figure 4). The proposed system splits into a Server-side and a Web Clientside. The Server-side does the video processing and deals with the requests from the Web Client-side. The Client-side is used to view the video archive and all the metadata associated with it (see Figure 5). It will also provide mechanism to validate the extracted content. With the Web Client, users can access the system wherever they are, not being restricted to a local area or specific software, as they will only need an Internet connection and a browser. The Server-side was implemented using C++ and openFrameworks [4] (a framework that provides easy API binding to access video content and also creates an abstract layer that makes the application independent from the operative system). The Client-side was developed with Flex 3.0. This technology provides an easy way to prototype the interface for the final user without jeopardizing the expected functionality. content) that happens in the beginning and at the end of a scene (see Figure 6). The next two sections explain the video processing task performed in the server (section 3) and the client-side with Web user interface used to search and to browse media content (section 4). Normally this occurs when there are effects between scenes – e.g., fade or dissolve. The frames obtained in this way are the input of the following techniques. Figure 6 - Frames With Noise. 3.2 Face Detection Faces are pervasive in video content and provide preliminary indexing. We integrated the Viola and Jones [6] algorithm to detect faces that appear on images. It works with Integral Images, which makes the algorithm to compute convolution filters on areas work in a very fast way. The Viola and Jones algorithm is based on a set of cascades of classifiers, previously trained, that are applied in image regions. This algorithm has some limitations, for instance, it does not detects partial faces or faces in a profile view. It also produces some false positives. To overcome these problems the user will be included in the process in order to eliminate them. 3.3 Image Descriptors For video access, some queries require the comparison of an images or image region. Our proposal uses the information extracted with the Scale-Invariant Feature Transform (SIFT) [7] and with the Speeded Up Robust Features (SURF) [8] to compare images. These algorithms find keypoints on images that are invariant to scale and rotation and extract the descriptor that represents the area around that keypoint. This descriptor is used for matching purposes between images. On Figure 7, the red dots identify the detected keypoints and the blue lines are drawn between those that match. Figure 5 - Client and Server Side (ViewProcFlow). 3. VIDEO PROCESSING As mentioned on section 2, a great portion of the video archive is still on tape format which means that all the clips from one tape, when converted to digital, are joined in one final clip. For that matter, the segmentation is essential to extract the scenes from the clip and get a better representation of it. All the metadata gathered is stored in XML (eXtensible Markup Language) files, because it is the standard format for data to be shared and also to help the integration with the Web Client. 3.1 Video Segmentation To accomplish the segmentation of the video, we used a simple difference of histograms [5] to detect scenes from the video. Once the scenes are detected, one frame is chosen to identify it. The middle frame of the shot is selected to represent the whole scene. This criterion was used to avoid noise (regarding semantic Figure 7 - Example of Matching Images With Extracted Keypoints. 3.3.1 SIFT The keypoints are detected using a filter of difference of Gaussians applied to the image. The next step computes the gradient in each pixel in a region of 16x16 pixels around each keypoint. For each 4x4 pixel block of the region, an histogram is calculated considering 8 directions of the gradient. The descriptor is created with these histograms (a vector with 128 values). 3.3.2 SURF This method uses smaller regions than the SIFT method to create the feature descriptor. It also uses the Integral Images technique [6] in the process to increase the performance of the algorithm. These regions are 8x8 centered over the keypoint, and are divided on 4x4 blocks, to which is applied Haar wavelet. The results for each sub-region are added to create a descriptor composed by 64 values. Table 2 - Some Examples of Concepts Matched With Thesaurus Categories Concepts Thesaurus Category Car, Bicycle, Vehicle, Train 4816 – Land Transport Airplane 4826 – Air and Space Transport Nature, Plants, Flowers 5211 – Natural Environment Trees 5636 - Forestry Partylife 2821 – Social Affairs Church 2831 – Culture and Religion Food 60 – Agri-Foodstuffs 3.4 Semantic Concepts With the work described in [9], it is possible to browse a personal library of photos based on different concepts (see Table 1). Photos were classified and could be accessed in several client applications. Table 1 - Trained Concepts Fish 5641 - Fisheries Baby, Adult, Child, Teenager, Old Person 2806 – Family, 2816 – Demography and Population Mountains, River 72 - Geography Indoor Snow Beach Nature We also plan to use human intervention for the abstract categories in a semi-automatic process, which requires appropriate user interfaces as described next. Face Beach 4. USER INTERFACES Party People Our proposal is based on the Regularized Least Squares (RLS) [9] classifier that performs binary classification on the database (e.g., Indoor versus Outdoor or People versus No People). It also uses a sigmoid function to convert the output of the classifier in a pseudo-probability. Each concept is trained using a training set composed of manually labeled images with and without the concept. After estimating the parameters of the classifier (that is, after training), the classifier is able to label new images. Each image is represented by visual features, which are automatically extracted. We used the Marginal HSV color Moments [9] and the features obtained by applying a bank of Gabor Filters [9] as image representations. Using these classifiers, the tool was capable of executing interesting queries like “Beach with People” or “Indoor without People”. Duvideo usually uses a set of categories to access the archives. Table 2 presents a subset of the Thesaurus used by Duvideo and a set of related concepts obtained from the list of concepts used in ImageCLEF [10] for submissions on “Visual Concept Detection and Annotation Task”. Currently, we have trained a subset of the concepts presented in Table 1 and we plan to train most of the ones in Table 2. Since we know that for several categories on thesaurus, it is difficult to identify features due to abstraction level of the subject – e.g., “Rights and Freedoms” or “Legal form of Organizations”, we will overcome these difficulties by looking into categories that we may find connections with several individual concepts using ontologies In order to take the most of the metadata produced, the interface for its visualization and use is a key part of the system. Preliminary specifications were done based on input from the potential users. The interface can be divided in two main groups of functionalities: browsing and searching features. The interface starts with an overall view of the whole video archive on the right side and the main search parameters on the left (see Figure 8). Figure 8 - Web Client, Home Page The following options are available to the user to specify the search: Date: “Before” a specific date, “After” a specific date or “Between” two dates. Status: “Valid Videos”, “Videos to Validate”. Once the videos are processed on the server, they are labeled as “Videos to Validate”. After the user approves the metadata, the video is made “Valid”. Thesaurus: a set of categories to identify the context of the video such as Science and Technology, Art, Sports, History among others, as presented above. Concepts: a second set of options to identify concepts such as “Indoor”, “Outdoor”, “Nature”, and “People”. Image: the possibility of conducting an image-based search. Text Input: searches into the annotations, titles and all textual data stored with the video. In case that the user wants to add an image to the search, it is possible to use one that already is on the library of images or upload one to the server. However, sometimes the image chosen has more elements that the user wants and for that we provide a simple editor allowing the user to select the area of interest (see Figure 9), either a rectangle or a circle. Since the search with an image will be based on the SURF descriptors, the user will see all the detected ones, which will help on not choosing areas with few or none descriptors. Also, to make the results more flexible, the user can specify the minimum percentage of keypoints for a scene to be considered a hit – i.e., with a lower percentage and the area selected on Figure 9, a scene that only include the eye or the mouth of the child, would be added to the result, since a smaller fraction of keypoints in the selected area would be required for the scene to be considered a hit. The extracted metadata (faces, scenes, concepts) is organized in “Timelines” and when one type is selected, all the correspondent data will appear and will function as anchors to its position on the video. In order to give a better perspective of where the data occurs, marks are added to the video timecode. Since the automatic algorithms lead to some errors, we provide validation options for the extracted metadata. For the cases where there are mistakes - e.g., one scene - it is possible to remove it, but if it is a more general issue, the user can send the video to be reprocessed and parameterize the segmentation algorithm or choose a different one. 5. CONCLUSIONS AND FUTURE WORK The paper presents the first version of our proposal to integrate different technologies for metadata extraction to be included in the workflow of a video production company. The process to automatically detect scenes on a video is an important feature because of the large amount of unsegmented videos on tape format. Even if the process is not perfect and it requires some minor human corrections, it will save manpower and time. Regarding the semantic concepts, we will work with Duvideo’s thesaurus, which is being built based on the EUROVOC standard [11]. Currently it has 21 domains with 127 micro-thesaurus described. When this proposal is finished, we can assess what kind of concepts can be trained to automatically catalog and search the content, either by training a specific one or by composing several concepts. There are also some features that would enrich the current version of the applications such as: Figure 9 - Simple Image Editor When the search is performed, the library view will be updated with the current results. A popup window will appear once the user selects one of the videos from the results. It will contain all the scenes that got a hit from the search, creating a shortcut for all, and thus facilitating the access to it. When a video is selected, a visualization screen (see Figure 10) allows the user to observe the video and all the extracted metadata. Figure 10 - Video Visualization Face recognition: adding an automatic process to identify persons on videos would bring new functionalities to the prototype. A first step, could be gender classification, a technique describe on [12]. The possibility of doing some level of video edition and creating new stories by cutting and joining scenes. Regarding the existing tasks, increasing the performance (see Table 3) with the introduction of parallel computing could lead to better results. In a similar way, the usage of a native XML database, like sedan [13], will help on accessing data and executing queries for the textual parameters. Table 3 – Tasks Duration Task Avg. Duration Histogram Difference Between Two Images 0.17s Face Detection 0.02s SURF Descriptors Extraction 0.94s Match Two Images (With 400 Descriptors Each) 0.21s Finally, the full integration of the prototype with the MXF format is a key aspect to be handled next [14]. Interactions between the development team and the users in the production company are also a vital component for the success of the project, and this will be carried on in informal exchanges and more formal user tests based on questionnaires and task analysis. [6] Viola, P., and Jones, M. Robust real-time object detection. In International Journal of Computer Vision (2001). 6. ACKNOWLEDGMENTS [7] Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91–110. This work was partially funded in the scope of Project VideoFlow (003096), “Iniciativa QREN, do financiamento UE/FEDER, através do COMPETE – Programa Operacional Factores de Competitividade”. Our thanks to Duvideo for all the help and cooperation on all the life cycles of the project, from requirements elicitation to the validation of the prototype. 7. REFERENCES [1] Devlin, B., Wilkinson, J., Beard, M., and Tudor, P. The MXF Book: Introduction to the Material eXchange Format. Elsevier, March 2006. [2] Sony. PDZ-1 Optical XDCAM Proxy Browsing Software. https://servicesplus.us.sony.biz/sony-software- modelPDZ1.aspx, June 2010. [3] ScheduALL. http://www.scheduall.com, June 2010. [4] Lieberman, Z., Watson, T., and Castro, A. openFrameworks. http://www.openframeworks.cc/, October 2009. [5] Dailianas, A., Allen, R. B., and England, P. Comparison of automatic video segmentation algorithms. In In SPIE Photonics West (1995), pp. 2–16. [8] H. Bay, T. Tuytelaars, and L. J. V. Gool, “Surf: Speeded up robust features,” in ECCV (1), pp. 404–417, 2006. [9] Jesus, R. Recuperação de Informação Multimédia em Memórias Pessoais. PhD thesis, Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologias, September 2009. [10] ImageCLEF, Image Retrieval in CLEF http://www.imageclef.org/2010, June 2010. [11] EUROVOC - Thesaurus. http://europa.eu/eurovoc/, June 2010. [12] Grangeiro, F., Jesus, R., and Correia, N. Face recognition and gender classification in personal memories. In ICASSP ’09: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (Washington, DC, USA, 2009), IEEE Computer Society, pp. 1945–1948. [13] MODIS, sedna - Native XML Database System. http://modis.ispras.ru/sedna/, May 2010. [14] XML Schema for MXF Metadata. Tech. rep., MOG Solutions, February 2005