Illinois Group in Star Challenge PART I: visual data processing PART II. Audio Search Liangliang Cao, Xiaodan Zhuang University of Illinois at Urbana-Champaign www.beckman.uiuc.edu What is Star Challenge? • Competition to Develop World’s NextGeneration Multimedia Search Technology • Hosted by the Agency for Science, Technology and Research (A*STAR), Singapore. • A real-world computer vision task which requires large amounts of computation power www.beckman.uiuc.edu But low rewards 56 teams Round 1 Round 2 8 teams 7 teams from 17 countries Round 3 Grand Final in Singapore www.beckman.uiuc.edu 5 teams But low rewards 56 teams No rewards No rewards No rewards Round 1 Round 2 8 teams 7 teams from 17 countries Round 3 Only one team can win US$100,000 Grand Final in Singapore 5 teams No rewards www.beckman.uiuc.edu But we have a team with no fears… Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao Xiaodan, Lyon, Paritosh, Mark, Tom, Mandar Sean, Jui-Ting, Zhen, Huazhong, Xi www.beckman.uiuc.edu But we have a team with no fears… Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao Xiaodan, Lyon, Paritosh, Mark, Tom, Mandar Sean, Jui-Ting, Zhen, Huazhong, Xi www.beckman.uiuc.edu Let’s go over our experience and stories… www.beckman.uiuc.edu Outlines • Problems of Visual Retrieval • Data • Features • Algorithms • Results (first 3 rounds) www.beckman.uiuc.edu 3 Audio Retrieval Tasks Task Query Target Metric AT1 IPA sequence segments that contain the query IPA sequence regardless of its languages AT2 an utterance spoken all segments that contain the query word/phrase/sentence regardless of its spoken languages by different speakers AT3 No queries extract all recurrent segments which are at least 1 second in length Data Set “Mean Average Precision”: 1 L 1 MAP L i 1 Ri i , j j 1 Ri 25 hours monolingual database in round1; 13 hours multilingual database in round3 F-measure F 1 D (1 ) Pd Rd D d 1 Pd Rd Xiaodan will talk about this part…… www.beckman.uiuc.edu 3 Video Retrieval Tasks Task Query Target Criteria Single Image 20 queries (short) Video Segs All the similar Segs “visually similar” VT2 Short Video Shot (<10s) 20 queries (long) Video Segs All the similar Segs Perceptually Similar VT3 Videos with sound (3~10s) Order of 10K Category number VT1 learning the common visual characteristics www.beckman.uiuc.edu Metric Mean Average Precision: AP N 1 P(r ) rel (r ) R r 1 1 r P(r ) rel (i ) r i1 Classification accuracy Data Set 20 categories, multiple labels possible 10 categories, multiple labels possible 10(20) categories, including one “others” category 100. Not-Applicable, None of the labels 101. Crowd (>10 people) 102. Building with sky as backdrop, clearly visible 103. Mobile devices including handphone/PDA 104. Flag 105. Electronic chart, e.g. stock charts, airport departure chart 106. TV chart Overlay, including graphs, text, powerpoint style 107. Person using Computer, both visible 108. Track and field, sports 109. Company Trademark, including billboard, logo 110. Badminton court, sports 111. Swimming pool, sports 112. Closeup of hand, e.g. using mouse, writing, etc 113. Business meeting (> 2 people), mostly seated down, table visible 114. Natural scene, e.g. mountain, trees, sea, no pple 115. Food on dishes, plates 116. Face closeup, occupying about 3/4 of screen, frontal or side 117. Traffic Scene, many cars, trucks, road visible 118. Boat/Ship, over sea, lake 119. PC Webpages, screen of PC visible 120. Airplane 20 VT1 Categories www.beckman.uiuc.edu 10 Categories for VT2 201. People entering/exiting door/car 202. Talking face with introductory caption 203. Fingers typing on a keyboard 204. Inside a moving vehicle, looking outside 205. Large camera movement, tracking an object, person, car, etc 206. Static or minute camera movement, people(s) walking, legs visible 207. Large camera movement, panning left/right, top/down of a scene 208. Movie ending credit 209. Woman monologue 210. Sports celebratory hug www.beckman.uiuc.edu 5 Categories for VT3 • 101. Crowd (>10 people): • 102. Building with sky as backdrop, clearly visible • 107. Person using Computer, both visible • 112. Closeup of hand, e.g. using mouse, writing, etc; • 116. Face closeup, occupying about 3/4 of screen, frontal or side www.beckman.uiuc.edu Video+Audio Tasks in Round 3 1) Audio search (AT1 or AT2) 5 queries will be given, either in the form of IPA sequence or waveform, and the participants are required to solve 4. 2) Video search (VT1) 5 queries will be given and the participants are required to solve 4. 3) Audio + Video search (AT1 + VT2) The search queries for this task are a combination of IPA sequence/waveform and video category. The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequence/waveform and video category respectively. 3 queries will be given and the participants are req uired to solve 2. www.beckman.uiuc.edu Examples of Images www.beckman.uiuc.edu More samples www.beckman.uiuc.edu Evaluation Video Data in Round2 31 Mpeg Videos, ~20 hours 17289 frames for VT1 in total 40994 frames for VT2 in total 32508 pseudo key frames, 8486 real key frames www.beckman.uiuc.edu Evaluation Video Data of Round3 Video Files: 27 Mpeg1 files, (13 hours of video/audio in total) Key frames for VT1: 10580 .jpg files Key frames for VT2: 64546 files in total, including 10580 .jpg files (true key frames) + 53966 .jpg files (pseudo key frames) Video: 352*288 www.beckman.uiuc.edu Computation Powers Work Stations in IFP 10 Servers, 2~4 CPU each, 36CPU in total IFP-32 Cluster: 32 dual-core 2.8G 64bit CPU CSL Cluster: Trusted-ILLIAC: 256 nodes with dual 2.2 GHz Opterons, 2 GB of RAM and 73 GB SCSI Ultra320 disks; Monolith: 128 node cluster with dual Pentium III CPUs at 1 Ghz with 1.5 GB of RAM per node TeraGrid: www.beckman.uiuc.edu Time Cost for Video Tasks Data Decompression: 15 minutes; Video Format Conversion: 2 hours; Video Segmentation (for VT2): 40 minutes Sound Track Extraction: 30 minutes; Feature Extraction: Global Feature 2: 2 hours (c) Global Feature 1: 2 hours (c) Patch-based Feature1: 2 hours (c) Patch-based Feature2 :5 hours (matlab) Semantic Feature 1: 24 hours (matlab) Semantic Feature 2: 3 hours (c) Semantic Feature 3: 4 hours (c) Motion Feature 1: 24 hours (matlab) Motion Feature 2: 3 hours on t-Illiac Classifier Training: Classifier 1: 1 hour (on IFP cluster,25 CPU, matlab) Classifier 2: 20 minutes Classifier 3: less than 10 minutes www.beckman.uiuc.edu Possible Accelerations for Video Matlab codes to C Parallel computing GPU Acceleration: Patch based features Load time is the major issue Extracting all the features after one load www.beckman.uiuc.edu Features for Round2- VT1 Image Features SIFT HOG GIST APC LBP Color, Texture, and etc Semantic Feature www.beckman.uiuc.edu Features for Round2-VT2 Character Detector Harris corner morphological operations Optical Flow Lucas-Kanade on spatial intensity gradient Gender recognition SODA-boost based Motion History Image Spatial interest points www.beckman.uiuc.edu GUFE: Grand Unified Feature Extractor Designed by Dennis Collects features generated by team members into one standard format Retrieval by Query Expansion based on NN Feature Normalization/Combination Result Visualization www.beckman.uiuc.edu Observations 1. Samples under the same category are more semantic similar to each other; 2. The shot boundaries are not well defined 3. some of the key frames are not labeled correctly. e.g., VT1 101, 103(26-141); www.beckman.uiuc.edu Algorithms Query-expansion Input: a query image and its category number. 0. Preprocessing: compute the matching between the evaluation and the development data 1. Expand the query image by retrieving all the images from the development data set with the same category. 2. Search the evaluation set with the expanded query. Output: return the top 50/20 results. www.beckman.uiuc.edu Algorithms Query-expansion GMM Based Approach Motivation: using a GMM to model the distribution of patches 1. Train a UBM (Universal Background Model) based on patches from all training images 2. MAP Estimation of the distribution of the patches belonging to one image given UBM 3. Compute pair-wise image distance based on patch kernel and within-class covariance normalization 3. Retrieving images based the normalized distance www.beckman.uiuc.edu VT1 Performance (#2 in 8) Category MAP •101. Crowd (>10 people): 0.8419 •102. Building with sky as backdrop, clearly visible 0.977 •103. Mobile devices including handphone/PDA 0.028 •107. Person using Computer, both visible •109. Company Trademark, including billboard, logo 0.2281 0.96 •112. Closeup of hand, e.g. using mouse, writing, etc; 0.4584 •113. Business meeting (> 2 people), mostly seated down, table visible 0.0644 •115. Food on dishes, plates, 0.2285 •116. Face closeup, occupying about 3/4 of screen, frontal or side 0.9783 •117. Traffic Scene, many cars, trucks, road visible, 0.2901 www.beckman.uiuc.edu VT2 Performance(#1in8) Category MAP •202. Talking face with introductory caption 0.8432 •206. Static or minute camera movement, people(s) walking, legs visible, 0.0581 •207. Large camera movement, panning left/right, top/down of a scene, 0.7789 •208. Movie ending credit 0.2782 •209. Woman monologue, Zhen 0.9756 www.beckman.uiuc.edu Performance of Round3 (#1in7) Task 2 (VT1) Target Estimated MAP (R=20) 101. Crowd (>10 people): 0.64 102. Building with sky as backdrop, clearly visible 1 107. Person using Computer, both visible 0.7 112. Closeup of hand, e.g. using mouse, writing, etc; 0.527 116. Face closeup, occupying about 3/4 of screen, frontal or side 1 Task3 (AT1 + VT2) Retrieval Target VT2 only Video AT1 + VT2 R=20 202, face with introductory caption 1 0.03 209, women monolog 0.35 0.1 201, People entering door N/A www.beckman.uiuc.edu We are: 2nd in Audio search 4th in Video search 2nd in AV search 1nd overall Illinois Group in Star Challenge Part II. Audio Search A general index / retrieval approach leveraging on speech recognition output lattices Experience in a real-world audio retrieval task The Star Challenge Experience in speech retrieval in an unknown language www.beckman.uiuc.edu (Audio) Information Retrieval Problem definition • Task Description: given a query, find the “most relevant” segments in a database k r u: d p r ai s ^ z “CRUDE PRICES” Thousands Of Audio files Top N file IDs 36 www.beckman.uiuc.edu (Audio) Information Retrieval “Standard” Methods • Published Algorithms: – EXACT MATCH: segment = argmini d(query,segmenti) where d is the string edit distance • Fast – SUMMARY STATISTICS: segment = argmaxi p(query|segmenti), bag-of-words, no concept of “sequence” • Good for text, e.g., google, yahoo, etc. – TRANSFORM AND INFER: segment = argmaxi p(query|segmenti) argmaxi E(count(query)|segment), word order matters • Flexible, but slow.... www.beckman.uiuc.edu Example: Language-Independent Speech Information Retrieval Voice activity detection Perceptual freq warping Gaussian mixtures Likelihood Vector bi=p(observationt|statet=i) Inference Algorithm: Finite State Transducer built from ASR Lattices E(count(query|observations)) Retrieval Ranking = E(count(query|segment observations)) www.beckman.uiuc.edu STAR Challenge Audio Retrieval Tasks Task Query Target AT1 IPA sequence segments that contain the query IPA sequence regardless of its languages AT2 an utterance spoken by different speakers all segments that contain the query word/phrase/sentence regardless of its spoken languages www.beckman.uiuc.edu Data Set 25 hours monolingual database in round1; 13 hours multilingual database in round3 STAR Challenge Audio Retrieval Tasks • • • • Genuine retrieval tasks – without pre-defined categories The queries are human speech or IPA sequences – one or multiple languages. Queries might be only part of the speech in the provided segments in the database The returned hits should be ordered and only the first 50 or 20 are submitted. www.beckman.uiuc.edu Feature Extraction on Short-time windows Spectral Features Speech Recognition Lattices Query FSA Generation FSM Index ... ... FSA Generation FSM Index Indexing / Query / Retrieval All Group Indices FSM-based Indexing All audio segments | Speech Recog. | Audio Archive Group Index .. . FSM-based Retrieval Group Index FSM-based Query Query Construction Construction Empirical Knowledge-based Phone confusion Phone confusion www.beckman.uiuc.edu Retrieval Results 41 Automatic speech recognition Better performance with language-specific systems than language-independent systems No inter-language mismatch between training and testing (acoustic model, pronunciation model, language model) e.g., for English data, we can use all knowledge of English (training data, pronunciation dictionary, language model) language-specific (e.g., English) word recognition www.beckman.uiuc.edu 42 Automatic speech recognition Multilingual database and queries ○ We might fail to identify which particular language the testing data is; Even if we know, we might not have the speech recognition models for that language. ○ We might encounter unseen languages (Tamil, Malay…) Language-Independent phone-based recognition instead of word-based recognition www.beckman.uiuc.edu 43 Summary of corpora of different languages www.beckman.uiuc.edu 44 Recognition results --- All languages/datasets are not equal www.beckman.uiuc.edu 45 Acoustic model Spectral features: 39-dim PLP, cepstral mean/variance normalization per speaker Modeling: HMMs with {11,13,15,17}-mixture Gaussians Context-dependent language-dependent modeling Left context- CENTRAL PHONE +right context % language referred to language-dependent “triphones” e.g., sound /A/ in different context and of different language ^’-A+b%Eng ^’-A+b’%Eng >-A+cm%Chinese …. www.beckman.uiuc.edu 46 Acoustic model - Context-Dependent Phones www.beckman.uiuc.edu Acoustic model - clustering Categories for decision tree questions Right or left context Distinctive phone features (manner/place of articulation) Language identity Lexical stress Punctuation mark www.beckman.uiuc.edu ^’-A+b%Eng ^’-A+b’%Eng >-A+cm%Chinese …. Language/Sequence model --- N-gram P(Wi | Wi 1 Wi N 1 ) If there is a pronunciation model, a particular sequence of context-dependent phone models (acoustic models) can be converted to a particular word sequence, which is modeled by N-gram. If there is no pronunciation model, the contextdependent phone sequence is directly modeled by N-gram. www.beckman.uiuc.edu 49 Solutions for English Speech Retrieval Speech recognition frontend Features: ○ ○ Perceptual Linear Predictive Cepstra + Energy first/second orders regression coefficients Speech recognizer ○ ○ ○ Triphone-clustered English speech recognizer English acoustic model, English dictionary, English language model Lattice generation using a Viterbi decoder Audio Archive Feature Extraction on Short-time windows Speech Recognition www.beckman.uiuc.edu Spectral Features Lattices 50 Using lattices as the representation of speech data I. A compact way to represent numerous alternative hypotheses output by the speech recognizer •A simple example • a b a or b a •Some more complex examples: www.beckman.uiuc.edu •Some more complex examples: James 1995 Mangu et al. 1999 www.beckman.uiuc.edu Using lattices as the representation of speech data II. Enable more robust speech retrieval • One best hypothesis by speech recognition is not reliable enough. • Lattices can be represented as finite state machines, which can be used in speech retrieval and take advantage of general weighted finite state machine algorithms. • Robust matching between query and audio files www.beckman.uiuc.edu Solutions for English Speech Retrieval Indexing the audio archive Input from ASR frontend ○ ○ Lattices of database segments English vocabulary Constructing log semiring automatas for each segment Construct FST-based segment index files Combine segment index files into a few group index files for retrieval. All Group Indices FSM-based Indexing All audio segments FSA Generation FSM Index ... ... FSA Generation FSM Index www.beckman.uiuc.edu Group Index .. . Group Index 54 Indexing the audio archive (example) Speech recognition output: Lattices for two audio files Index for the two files Allauzen et al., 2004 www.beckman.uiuc.edu Solutions for English Speech Retrieval Making FST-based queries Provided queries each as an IPA sequence ○ Building an automata from the IPA sequence and expand each IPA arc into alternative arcs for 'similar' IPA ○ Building a query FSA, incorporating constraints Provided queries each as audio waveform ○ ○ ○ Processed by ASR frontend Building log semiring automata Build query FSA www.beckman.uiuc.edu 56 Query FSM-based Query Query Construction Construction Empirical Knowledge-based Phone confusion Phone confusion Example of query expansion (phone sequence to word FSA) k r u: d p r ai s ^ z “CRUDE PRICES” www.beckman.uiuc.edu 57 Solutions for English Speech Retrieval All Group Indices Retrieval using queries Parallel retrieval in all group index files Order retrieved segment ids FSM-based Retrieval truncate/format results Fusing results with different precision/recall tradeoff obtained by different settings www.beckman.uiuc.edu 58 Solutions for Multilingual Speech Retrieval • ASR frontend •Multi-lingual word/phone speech recognizer •Language-independent phone recognizer • Indexing the audio archive –Multi-lingual index –Language-independent index • Making FST-based queries –Multi-lingual queries –Language-independent queries • Retrieval using queries –Same as in language-specific retrieval www.beckman.uiuc.edu 59 Audio Retrieval Thousands Of Audio files k r u: d p r ai s ^ z “CRUDE PRICES” Top N file IDs What’s the “optimal” model set? Language-specific phones/words Language-independent phones General sub-word units Size of inventory - using only the frequent symbols is better Data driven - selected by clustering tree 62 www.beckman.uiuc.edu Automatic Speech Retrieval in an Unknown Language FSM-based fuzzy matching and retrieval Thousands Of Audio files k r u: d p r ai s ^ z “CRUDE PRICES” Audio Archive Feature Extraction on Short-time windows Spectral Features Speech Recognition FSM-based Indexing FSM Index FSA Generation ... ... FSA Generation Query FSM Index Top N file IDs Speech in unknown language Word Triphone Phone Other acoustic units Lattices All Group Indices Group Index ... FSM-based Retrieval Group Index Query in unknown language FSM-based Query Query Construction Construction Empirical Knowledge-based Phone confusion Phone confusion Retrieval Results www.beckman.uiuc.edu Represented as ? Automatic Speech Retrieval in an Unknown Language modeled as a special case (below) of the cognitive process called assimilation (above). Accommodation ? (introducing new models) Speech in unknown language English Phones Spanish Phones Russian Phones Language-dependent (LD) phones Represented as Language-independent (LID) LID phone lattices phone clusters www.beckman.uiuc.edu based on pairwise KL divergence LD models Multilingual Subword Units in an Unknown Language • Language-dependent versions of the same IPA symbols can end up: – in one cluster, e.g., /z/ or /t∫/ – in different clusters, e.g., /j/ or /I/ • Different IPAs may be similar: …… Phone recognition Croatian for training Croatian Unknown Retrieval w/ audio query Retrieval w/ IPA sequence query Croatian Clustering gives a more compact phone set, and performs better www.beckman.uiuc.edu • Thank you. www.beckman.uiuc.edu