SP4 : Segmentation et Authentification conjointes de deux modalités (voix/visage) Rapport – Protocole de Base de Données EURECOM- LIA BIOBIMO RNRT – BIOmétrie BImodale sur MObile http://biobimo.eurecom.fr 1 Table of Contents 1. Contents ..................................................................................... 3 2. Database Population .................................................................. 3 3. Video Format ............................................................................. 4 4. Audio Format ............................................................................. 5 5. Sessions ...................................................................................... 5 BIOBIMO RNRT – BIOmétrie BImodale sur MObile http://biobimo.eurecom.fr 2 The RNRT project BioBiMo requirements call for the acquisition of a mini database for the purpose of testing and demonstrating bimodal algorithms combining audio and face recognition. This document outlines the acquisition protocol for the database. 1. Contents The database contents were restrained by the fact that they would be used both for Audio and face based recognition systems, thus they must exhibit the required variability. Blinking Each recording session will start with a number of blinks (20 Blinks per 10 seconds), for localization of the eyes and thus the head. Password Set The password set consists of 10 passwords which are predefined words or short sentences of 1 to 3 second duration. The aim of this dataset is to provide data for audio person recognition in a password based scenario. Question / Answer Session This session is required to provide text independent data for behavioral face recognition. The session consists of answers to one out of the 10 predefined questions recorded for 30 seconds. Numbers The person narrates numbers 1 up to 30. 2. Database Population The database has been divided into 3 parts based on the project scenario and testing protocol. They are Client The first part consists of client data of 20 speakers and all 10 sessions are recorded for this dataset. It is to be used both for training and testing the algorithms. Each session consists of Password Set Question / Answer Session Numbers Imposter The aim of this dataset is to test the algorithm by presenting imposters. It consists of data of 10 speakers and all 10 sessions are recorded for this dataset. Each session consists of Password Set BIOBIMO RNRT – BIOmétrie BImodale sur MObile http://biobimo.eurecom.fr 3 Question / Answer Session Numbers World This dataset is required to create a general model of the world, and would consist of minimum 20 persons speaking random sentences for duration of 30-40 seconds. 3. Video Format Camera Settings Making manual setting to cameras should be avoided, i.e. whichever camera model is selected should be able and left to set the exposure and white balance on its own. Illumination Illumination is one of the major concerns in visual feature extraction. Bad lighting conditions affect the process in two ways, number one it alters the color composition which is one of basis for feature extraction. The other being it hides the features altogether like shadow of the nose can hide parts of the mouth. Video resolution This specification also plays an important role in feature extraction. If the resolution is too low it will hinder in exact localization of feature points like the tips of the lips etc. Thus we propose a minimum resolution of 640 X 480 pixels. Temporal Resolution Although it does not have a direct effect on the feature extraction but a low frame rate can cause lose of data and consequently lose of classification information. We normally work on 25 frames / sec. Distance between eyes The distance between the eyes define the size of the face in terms of pixel. This specification is necessary to avoid a situation in which the video is 640 X 480 pixels but the face represents only a very small portion of the image thus being totally useless. In our study the distance between the eyes should be between 40-60 pixels. Video compression Although it is not feasible to avoid video compression totally, but we would prefer if there is no compression. This is a specification that can greatly affect the performance of our system. Compression usually introduces a blocking effect that destroys most edge information that is used by our system. Video Format The standard video format that we are using right now is “avi”, but this is not a major concern as format can be easily modified any time in the future. BIOBIMO RNRT – BIOmétrie BImodale sur MObile http://biobimo.eurecom.fr 4 Color Currently our system uses videos with color depth of 16 bits per pixel, as our system uses color also for feature extraction so we would like the color depth to either remain the same or higher. 4. Audio Format Recording is possible with various range of microphones, only the sampling frequency must remain constant i.e. 16 kHz. 5. Sessions A total of 10 sessions will have to be recorded, to ensure the necessary variability and to collect enough data for behavioral face recognition. The following the breakdown of the sessions. Indoor Session 6 indoor session in a semi controlled office environment with normal office lighting and minimal noise. Outdoor Session 3 external recording session per person in a corridor or outside building, where other people may be present or walking behind the speaker, with normal street noise levels. The lighting conditions could vary with situation but levels should be consistent with office lighting. Studio Session 1 studio recording with studio lighting and controlled background with noise as minimal as possible. BIOBIMO RNRT – BIOmétrie BImodale sur MObile http://biobimo.eurecom.fr 5