Towards Dolphin Recognition Carnegie Mellon University Tanja Schultz, Alan Black, Bob Frederking

advertisement
Towards Dolphin Recognition
Tanja Schultz, Alan Black, Bob Frederking
Carnegie Mellon University
West Palm Beach, March 28, 2003
Outline

Speech-to-Speech Recognition

Brief Introduction
 Lab, Research

Data Requirements
 Audio
data
 ‘Transcriptions’
 Towards
Dolphin Recognition
 Applications

Current Approaches

Preliminary Results
Part 1

Speech-to-Speech Recognition

Brief Introduction
 Lab, Research

Data Requirements
 Audio
data
 ‘Transcriptions’
 Towards
Dolphin Recognition
 Applications

Current Approaches

Preliminary Results
Speech Processing Terms

Speech Recognition
Converts spoken speech input into written text output

Natural Language Understanding (NLU)
Derives the meaning of the spoken or written input

(Speech-to-speech) Translation
Transforms text / speech from language A to text /
speech of language B

Speech Synthesis (Text-To-Speech=TTS)
Converts written text input into audible output
Speech Recognition
Speech Input - Preprocessing
Decoding
/ Search
h
e
l
l
o
Postprocessing - Synthesis
Hello
Hale Bob
Hallo
:
:
TTS
Fundamental Equation of SR
h
e
l
l
o
P(W/x) = [ P(x/W) * P(W) ] / P(x)
A-b A-m A-e
Acoustic Model
Am
Are
I
you
we
AE M
AR
AI
JU
VE
Pronunciation
I am
you are
we are
:
Language Model
SR: Data Requirements
A-b A-m A-e
Acoustic Model
Am
Are
I
you
we
AE M
AR
AI
JU
VE
Pronunciation

Audio Data
Sound set

Units built
from sounds

Text Data
I am
you are
we are
:
Language Model
Janus Speech Recognition Toolkit (JRTk)






Unlimited and Open Vocabulary
Spontaneous and Conversational Human-Human
Speech
Speaker-Independent
High Bandwidth, Telephone, Car, Broadcast
Languages: English, German, Spanish, French,
Italian, Swedish, Portuguese, Korean, Japanese,
Serbo-Croatian, Chinese, Shanghai, Arabic,
Turkish, Russian, Tamil, Czech
Best Performance on Public Benchmarks
DoD, (English) DARPA Hub-5 Test ‘96, ‘97 (SWB-Task)
 Verbmobil (German) Benchmark ’95-’00 (Travel-Task)

Mobil Device for Translation&Navigation
Multi-lingual Meeting Support
The Meeting Browser is a powerful tool that allows us to record a new
meeting, review or summarize an existing meeting or search a set of existing
meetings for a particular speaker, topic, or idea.
Multilingual Indexing of Video
• View4You / Informedia: Automatically records Broadcast News and
allows the user to retrieve video segments of news items for
different topics using spoken language input
• Non-cooperative speaker on video
• Cooperative user
• Indexing requires only low
quality translation
Part 2

Speech-to-Speech Recognition

Brief Introduction
 Lab, Research

Data Requirements
 Audio
data
 ‘Transcriptions’
 Towards
Dolphin Recognition
 Applications

Current Approaches

Preliminary Results
Towards Dolphin Recognition
Identification
Verification/Detection
?
Whose
Whosevoice
voiceis
it?
Whose
voice
isisthis?
this?
Is
it Nippy’s
Bob¡¯s
voice?
IsIsthis
this
Bob¡¯svoice?
voice?
?
?
?
?
Segmentation and Clustering
Where
are
speaker
Where
Whereare
aredolphins
speaker
changes?
changes?
changing?
Speaker A
Speaker B
Which
segments
are
Which
segments
Which
segmentsare
are from
from
the
same
speaker?
thethe
same
speaker?
same
dolphin?
Applications
‘off-line’ applications (off the water, off the boat, off season)



Data Management and Indexing

Automatic Assignment/Labeling of already recorded (archived) data

Automatic Post-Processing (Indexing) for later retrieval
Towards Important/Meaningful Units = DOLPHONES

Segmentation and Clustering of similar sounds/units

Find out about unit frequencies

Find out about correlation between sounds and other events
Whistles correlated to Family Relationship

Who belongs to whom

Find out about the family tree?

Can we find out more about social structure?
Applications
‘on-line’ applications


Identification and Tracking

Who is currently speaking

Who is around
Towards Important/Meaningful Units


Find out about correlation between sounds and other events
Whistles correlated to Family Relationship

Who belongs to whom
 Wide-range identification, tracking, and observation
(since sound travels longer distances than image)
Common Approaches
Training Phase
Training speech for
each dolphin
Nippy
Model for
each dolphin
Feature
extraction
Model
training
Nippy
xyz
Havana
Havana
Two distinct phases
Detection Phase
?
Feature
extraction
Detection
decision
Hypothesis:
Havana
Current Approaches

A likelihood ratio test is used for the detection decision
L = p(X|dolph) / p(X|dolph)
Dolphin model
Feature
extraction
/
Background
model


L
L   Accept
L   Reject
p(X|dolph) is the likelihood for the dolphin model when the
features X = (x1,x2,…) are given
p(X|dolph) is an alternative or so called background model trained
on all data but that of the dolphin in question
First Experiments - Setup

Take the data we got from Denise
 Alan did the labeling of about 160 files
 Labels:
dolphin sounds ~370 tokens
 electric noise (machine, clicks, others) ~180 tokens
 pauses ~ 220 tokens


Derive Dolphin ID from file name (educ. Guess)
(Caroh, Havana, Lag, Lat, LG, LH, Luna, Mel, Nassau, Nippy)



Train one model per dolphin, one ‘garbage’ model for the rest
Recognize incoming audio file; hypotheses consist of list of
dolphin and garbage models
Count number of models per audio file and return the name of
dolphin with the highest count as the one being identified
First Experiments - Results
20 Gaussians
10 Gaussians
LH
Lu
na
M
N el
as
sa
u
N
ip
py
100
90
80
70
60
50
40
30
20
10
0
C
ar
H oh
av
an
a
La
g
La
t
LG
Dolphin ID [%]
100 Gaussians
Next steps



Step 1: To build a ‘real’ system we need
 MORE audio data MORE audio data MORE ...
 Labels (the more accurate the better)
 Idea 1: Automatic labeling, live with the errors
 Idea 2: Manual labeling
 Idea 3: Automatic labeling and post-editing
Step 2: Given more data
 Automatic clustering
 Try first steps towards unit detection
Step 3: Build a working system, make it small and
fast enough for deployment
Download