Towards Dolphin Recognition Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003 Outline Speech-to-Speech Recognition Brief Introduction Lab, Research Data Requirements Audio data ‘Transcriptions’ Towards Dolphin Recognition Applications Current Approaches Preliminary Results Part 1 Speech-to-Speech Recognition Brief Introduction Lab, Research Data Requirements Audio data ‘Transcriptions’ Towards Dolphin Recognition Applications Current Approaches Preliminary Results Speech Processing Terms Speech Recognition Converts spoken speech input into written text output Natural Language Understanding (NLU) Derives the meaning of the spoken or written input (Speech-to-speech) Translation Transforms text / speech from language A to text / speech of language B Speech Synthesis (Text-To-Speech=TTS) Converts written text input into audible output Speech Recognition Speech Input - Preprocessing Decoding / Search h e l l o Postprocessing - Synthesis Hello Hale Bob Hallo : : TTS Fundamental Equation of SR h e l l o P(W/x) = [ P(x/W) * P(W) ] / P(x) A-b A-m A-e Acoustic Model Am Are I you we AE M AR AI JU VE Pronunciation I am you are we are : Language Model SR: Data Requirements A-b A-m A-e Acoustic Model Am Are I you we AE M AR AI JU VE Pronunciation Audio Data Sound set Units built from sounds Text Data I am you are we are : Language Model Janus Speech Recognition Toolkit (JRTk) Unlimited and Open Vocabulary Spontaneous and Conversational Human-Human Speech Speaker-Independent High Bandwidth, Telephone, Car, Broadcast Languages: English, German, Spanish, French, Italian, Swedish, Portuguese, Korean, Japanese, Serbo-Croatian, Chinese, Shanghai, Arabic, Turkish, Russian, Tamil, Czech Best Performance on Public Benchmarks DoD, (English) DARPA Hub-5 Test ‘96, ‘97 (SWB-Task) Verbmobil (German) Benchmark ’95-’00 (Travel-Task) Mobil Device for Translation&Navigation Multi-lingual Meeting Support The Meeting Browser is a powerful tool that allows us to record a new meeting, review or summarize an existing meeting or search a set of existing meetings for a particular speaker, topic, or idea. Multilingual Indexing of Video • View4You / Informedia: Automatically records Broadcast News and allows the user to retrieve video segments of news items for different topics using spoken language input • Non-cooperative speaker on video • Cooperative user • Indexing requires only low quality translation Part 2 Speech-to-Speech Recognition Brief Introduction Lab, Research Data Requirements Audio data ‘Transcriptions’ Towards Dolphin Recognition Applications Current Approaches Preliminary Results Towards Dolphin Recognition Identification Verification/Detection ? Whose Whosevoice voiceis it? Whose voice isisthis? this? Is it Nippy’s Bob¡¯s voice? IsIsthis this Bob¡¯svoice? voice? ? ? ? ? Segmentation and Clustering Where are speaker Where Whereare aredolphins speaker changes? changes? changing? Speaker A Speaker B Which segments are Which segments Which segmentsare are from from the same speaker? thethe same speaker? same dolphin? Applications ‘off-line’ applications (off the water, off the boat, off season) Data Management and Indexing Automatic Assignment/Labeling of already recorded (archived) data Automatic Post-Processing (Indexing) for later retrieval Towards Important/Meaningful Units = DOLPHONES Segmentation and Clustering of similar sounds/units Find out about unit frequencies Find out about correlation between sounds and other events Whistles correlated to Family Relationship Who belongs to whom Find out about the family tree? Can we find out more about social structure? Applications ‘on-line’ applications Identification and Tracking Who is currently speaking Who is around Towards Important/Meaningful Units Find out about correlation between sounds and other events Whistles correlated to Family Relationship Who belongs to whom Wide-range identification, tracking, and observation (since sound travels longer distances than image) Common Approaches Training Phase Training speech for each dolphin Nippy Model for each dolphin Feature extraction Model training Nippy xyz Havana Havana Two distinct phases Detection Phase ? Feature extraction Detection decision Hypothesis: Havana Current Approaches A likelihood ratio test is used for the detection decision L = p(X|dolph) / p(X|dolph) Dolphin model Feature extraction / Background model L L Accept L Reject p(X|dolph) is the likelihood for the dolphin model when the features X = (x1,x2,…) are given p(X|dolph) is an alternative or so called background model trained on all data but that of the dolphin in question First Experiments - Setup Take the data we got from Denise Alan did the labeling of about 160 files Labels: dolphin sounds ~370 tokens electric noise (machine, clicks, others) ~180 tokens pauses ~ 220 tokens Derive Dolphin ID from file name (educ. Guess) (Caroh, Havana, Lag, Lat, LG, LH, Luna, Mel, Nassau, Nippy) Train one model per dolphin, one ‘garbage’ model for the rest Recognize incoming audio file; hypotheses consist of list of dolphin and garbage models Count number of models per audio file and return the name of dolphin with the highest count as the one being identified First Experiments - Results 20 Gaussians 10 Gaussians LH Lu na M N el as sa u N ip py 100 90 80 70 60 50 40 30 20 10 0 C ar H oh av an a La g La t LG Dolphin ID [%] 100 Gaussians Next steps Step 1: To build a ‘real’ system we need MORE audio data MORE audio data MORE ... Labels (the more accurate the better) Idea 1: Automatic labeling, live with the errors Idea 2: Manual labeling Idea 3: Automatic labeling and post-editing Step 2: Given more data Automatic clustering Try first steps towards unit detection Step 3: Build a working system, make it small and fast enough for deployment