CS 4705
• In general, information about:
– What the speaker is trying to convey
• Is this a statement or a question?
– The speaker state
• Is the speaker getting angry, frustrated?
• In dialogue, information about:
– The structure of the dialogue
• Is the user or the system trying to start a new topic?
• Is the speaker talking about given or new information?
– The state of the interaction :
• Is the user having trouble being understood?
• Is the user having trouble understanding the system?
• New description schemes (e.g. ToBI)
• Corpus-based research and machine learning
• Emphasis on evaluation of algorithms and systems (NLE ‘00 special issue)
• Investigation of spontaneous speech phenomena and variation in speaking style
• Applications to CTS, ASR and SDS
• Public and semi-public databases
– ATIS, SwitchBoard, Call Home, Meetings
(NIST/DARPA/LDC)
– TRAINS/TRIPS (U. Rochester), FM Radio (BU), BDC
(Harvard, AT&T)
• Private collections
– Acquired for speech or dialogue research (August,
KTH; Voicemail, AT&T, IBM)
– Meetings, call centers, operator services, focus group collections
• The Web
– Newscasts, radio
• Developed by prosody researchers in four meetings over 1991-94
• Goals:
– devise common labeling scheme for Standard American
English that is robust and reliable
– promote collection of large, prosodically labeled, shareable corpora
• ToBI standards also proposed for Japanese,
German, Italian, Spanish, British and Australian
English,....
• Minimal ToBI transcription:
– recording of speech
– f0 contour
– ToBI tiers:
• orthographic tier: words
• break-index tier: degrees of junction (Price et al ‘89)
• tonal tier: pitch accents, phrase accents, boundary tones (Pierrehumbert ‘80)
• miscellaneous tier: disfluencies, non-speech sounds, etc.
• Online training material,available at:
– http://www.ling.ohio-state.edu/phonetics/ToBI/
• Evaluation
– Good inter-labeler reliability for expert and naive labelers: 88% agreement on presence/absence of tonal category, 81% agreement on category label,
91% agreement on break indices to within 1 level
(Silverman et al. ‘92,Pitrelli et al ‘94)
• Which items are made intonationally prominent and how?
• Accent type:
– H*
– L* simple high (declarative) simple low (ynq)
– L*+H scooped, late rise (uncertainty/ incredulity)
– L+H* early rise to stress (contrastive focus)
– H+!H* fall onto stress (implied familiarity)
•Downstepped accents:
•!H*, L+!H*, L*+!H
•Degree of prominence:
within a phrase: HiF0
across phrases
• Given/new information
–
S: Do you need a return ticket?
– U: No, thanks, I don’t need a return.
• Contrast (narrow focus)
– U: No, thanks, I don’t need a RETURN…. (I need a time schedule, receipt,…)
• Disambiguation of discourse markers
– S: Now let me get you the train information.
– U: Okay (thanks) vs. Okay….(but I really want…)
• Applications: TTS and CTS
• Corpora: read and spontaneous speech
• Features: pos window of 3, sentence position, position within NP, # of syllables, position in complex nominal, inferred given/new status, inferred focus, mutual information
• Results: 75-85% correct, depending on genre
• ‘Levels’ of phrasing:
– intermediate phrase: one or more pitch accents plus a phrase accent (Hor L)
– intonational phrase: 1 or more intermediate phrases + boundary tone (H% or L% )
• ToBI break-index tier
– 0 no word boundary
– 1 word boundary
– 2 strong juncture with no tonal markings
– 3 intermediate phrase boundary
– 4 intonational phrase boundary
• Disambiguates syntactic constructions, e.g. PP attachment, restrictive/non relative clause:
–
S: You should buy the ticket with the discount coupon.
–
S: The itinerary which I faxed includes deluxe accommodations
• Disambiguates scope ambiguities, e.g. Negation:
– S: You aren’t booked through Rome because of the fare.
• Or modifier scope:
– S: This fare is restricted to retired politicians and civil servants.
• Applications: TTS, CTS, ASR
• Corpora: AP news, Penn Treebank, ATIS
• Features: sentence position, sentence length, pos window of 4, location of previous predicted boundary, mutual information, constituent information, dependency structure
• Results: 96% correct
• What do intonational contours ‘mean’ (Ladd ‘80,
Bolinger ‘89)?
– Speech acts (statements, questions, requests)
S: That’ll be credit card?
(L* H- H%)
– Propositional attitude (uncertainty, incredulity)
S: You’d like an evening flight.
(L*+H L- H%)
– Speaker affect (anger, happiness, love)
U: I said four SEVEN one! (L+H* L- L%)
– “ Personality ”
S: Welcome to the Sunshine Travel System.
• Level of speaker engagement
–
S: Welcome to InfoTravel. How may I help you?
• Contour interpretation
– S: You can take the L*+H bus from Malpensa to Rome
L-H% .
– U: Take the bus. vs. Take the bus!
• Discourse/topic structure
– Topic beginnings have higher pitch range, faster, preceded by longer pauses
– Endings the opposite
• What makes an utterance sound angry? Sad?
– How much comes from the lexical information?
– How much from the acoustic/prosodic?
– Does all anger, e.g., sound the same?
• Cahn ‘88 (examples)
• Text-to-Speech and Concept-to-Speech generation: improve naturalness
• Speech Recognition: identify suprasegmental meaning
• Spoken Dialogue Systems: understand when people are confused, angry
• Audio Browsing: format corpora for browsing and search
• We don’t really know what most contours ‘mean’
• Our accent prediction needs more sensitivity to better model of given/new, focus, grammatical function
• Our phrasing prediction needs better information about e.g. attachment
• We don’t know much about emotional speech or
‘personality’ -- critical to applications