Goals and Objectives

advertisement
Automatic Phonetic and Prosodic
Annotation of Spoken Language
Steven Greenberg
International Computer Science Institute
1947 Center Street, Berkeley, CA 94704
http://www.icsi.berkeley.edu/~steveng
steveng@icsi.berkeley.edu
In Collaboration with Shawn Chang (ICSI) and Mirjam Wester (Nijmegen)
Acknowledgements and Thanks
Automatic Feature Classification and Analysis
Joy Hollenback, Lokendra Shastri, Rosaria Silipo
Research Funding
U.S. National Science Foundation
U.S. Department of Defense
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
– Hand labeling and segmentation is time consuming and expensive
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
– Hand labeling and segmentation is time consuming and expensive
– It is difficult to find qualified transcribers and training can be arduous
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
– Hand labeling and segmentation is time consuming and expensive
– It is difficult to find qualified transcribers and training can be arduous
• Automatic Alignment Systems (used in speech recognition) are Inaccurate
both in Terms of Labeling and Segmentation
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
– Hand labeling and segmentation is time consuming and expensive
– It is difficult to find qualified transcribers and training can be arduous
• Automatic Alignment Systems (used in speech recognition) are Inaccurate
both in Terms of Labeling and Segmentation
– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
– Hand labeling and segmentation is time consuming and expensive
– It is difficult to find qualified transcribers and training can be arduous
• Automatic Alignment Systems (used in speech recognition) are Inaccurate
both in Terms of Labeling and Segmentation
– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries
– Phone classification error is ca. 30-50%
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
– Hand labeling and segmentation is time consuming and expensive
– It is difficult to find qualified transcribers and training can be arduous
• Automatic Alignment Systems (used in speech recognition) are Inaccurate
both in Terms of Labeling and Segmentation
– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries
– Phone classification error is ca. 30-50%
– Speech recognition systems do not currently deal with prosody
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
– Hand labeling and segmentation is time consuming and expensive
– It is difficult to find qualified transcribers and training can be arduous
• Automatic Alignment Systems (used in speech recognition) are Inaccurate
both in Terms of Labeling and Segmentation
– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries
– Phone classification error is ca. 30-50%
– Speech recognition systems do not currently deal with prosody
• Automatic Transcription is Likely to Aid in the Development of Speech
Recognition and Synthesis Technology
Motivation for Automatic Transcription
• Many Properties of Spontaneous Spoken Language Differ from Those of
Laboratory and Citation Speech
– There are systematic patterns in “real” speech that potentially reveal underlying
principles of linguistic organization
• Phonetic and Prosodic Annotation Material is of Limited Quantity
– Phonetic and prosodic material important for understanding spoken language
and developing superior technology for recognition and synthesis
• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt
to Produce
– Hand labeling and segmentation is time consuming and expensive
– It is difficult to find qualified transcribers and training can be arduous
• Automatic Alignment Systems (used in speech recognition) are Inaccurate
both in Terms of Labeling and Segmentation
– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries
– Phone classification error is ca. 30-50%
– Speech recognition systems do not currently deal with prosody
• Automatic Transcription is Likely to Aid in the Development of Speech
Recognition and Synthesis Technology
– And therefore is worth the effort to develop
Road Map of the Presentation
• Introduction
–
–
–
–
Motivation for developing automatic phonetic transcription systems
Rationale for the current focus on articulatory-acoustic features (AFs)
The development corpus - NTIMIT
Justification for using NTIMIT for development of AF classifiers
Road Map of the Presentation
• Introduction
–
–
–
–
Motivation for developing automatic phonetic transcription systems
Rationale for the current focus on articulatory-acoustic features (AFs)
The development corpus - NTIMIT
Justification for using NTIMIT for development of AF classifiers
• The ELITIST Approach and Its Application to English
– The baseline system
– The ELITIST approach
– Manner-specific classification for place of articulation features
Road Map of the Presentation
• Introduction
–
–
–
–
Motivation for developing automatic phonetic transcription systems
Rationale for the current focus on articulatory-acoustic features (AFs)
The development corpus - NTIMIT
Justification for using NTIMIT for development of AF classifiers
• The ELITIST Approach and Its Application to English
– The baseline system
– The ELITIST approach
– Manner-specific classification for place of articulation features
• Application of the ELITIST Approach to Dutch
–
–
–
–
The training and testing corpus - VIOS
The nature of cross-linguistic transfer of articulatory-acoustic features
The ELITIST approach to frame selection as applied to the VIOS corpus
Improvement of place-of-articulation classification using manner-specific
training in Dutch
Road Map of the Presentation
• Introduction
–
–
–
–
Motivation for developing automatic phonetic transcription systems
Rationale for the current focus on articulatory-acoustic features (AFs)
The development corpus - NTIMIT
Justification for using NTIMIT for development of AF classifiers
• The ELITIST Approach and Its Application to English
– The baseline system
– The ELITIST approach
– Manner-specific classification for place of articulation features
• Application of the ELITIST Approach to Dutch
–
–
–
–
The training and testing corpus - VIOS
The nature of cross-linguistic transfer of articulatory-acoustic features
The ELITIST approach to frame selection as applied to the VIOS corpus
Improvement of place-of-articulation classification using manner-specific
training in Dutch
• Stress Accent Annotation
Road Map of the Presentation
• Introduction
–
–
–
–
Motivation for developing automatic phonetic transcription systems
Rationale for the current focus on articulatory-acoustic features (AFs)
The development corpus - NTIMIT
Justification for using NTIMIT for development of AF classifiers
• The ELITIST Approach and Its Application to English
– The baseline system
– The ELITIST approach
– Manner-specific classification for place of articulation features
• Application of the ELITIST Approach to Dutch
–
–
–
–
The training and testing corpus - VIOS
The nature of cross-linguistic transfer of articulatory-acoustic features
The ELITIST approach to frame selection as applied to the VIOS corpus
Improvement of place-of-articulation classification using manner-specific
training in Dutch
• Stress Accent Annotation
• Conclusions and the Future
Part One
INTRODUCTION
Motivation for Developing Automatic Phonetic Transcription Systems
Rationale for the Current Focus on Articulatory-Acoustic Features
Description of the Development Corpus – NTIMIT
Justification for Using the NTIMIT Corpus
Corpus Generation - Objectives
•
Provides Detailed, Empirical Material for the Study of Spoken Language
– Such data provide an important basis for scientific insight and understanding
– Facilitates development of new models for spoken language
Corpus Generation - Objectives
•
Provides Detailed, Empirical Material for the Study of Spoken Language
– Such data provide an important basis for scientific insight and understanding
– Facilitates development of new models for spoken language
•
Provides Training Material for Technology Applications
– Automatic speech recognition, particularly pronunciation models
– Speech synthesis, ditto
– Cross-linguistic transfer of technology algorithms
Corpus Generation - Objectives
•
Provides Detailed, Empirical Material for the Study of Spoken Language
– Such data provide an important basis for scientific insight and understanding
– Facilitates development of new models for spoken language
•
Provides Training Material for Technology Applications
– Automatic speech recognition, particularly pronunciation models
– Speech synthesis, ditto
– Cross-linguistic transfer of technology algorithms
•
Promotes Development of NOVEL Algorithms for Speech Technology
– Pronunciation models and lexical representations for
automatic speech recognition
speech synthesis
– Multi-tier representations of spoken language
Corpus-Centric View of Spoken Language
Our Focus in Today’s Presentation is on Articulatory Feature Classification
Other levels of linguistic representation are also extremely important to annotate
Our Focus
Our Focus
Rationale for Articulatory-Acoustic Features
• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the
Lowest (i.e., Phonetic) Tier of Spoken Language
– AFs can be combined in a variety of ways to specify virtually any speech sound found in
the world’s languages
Rationale for Articulatory-Acoustic Features
• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the
Lowest (i.e., Phonetic) Tier of Spoken Language
– AFs can be combined in a variety of ways to specify virtually any speech sound found in
the world’s languages
– AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments
Rationale for Articulatory-Acoustic Features
• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the
Lowest (i.e., Phonetic) Tier of Spoken Language
– AFs can be combined in a variety of ways to specify virtually any speech sound found in
the world’s languages
– AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments
• AFs are Systematically Organized at the Level of the Syllable
– Syllables are a basic articulatory unit in speech
Rationale for Articulatory-Acoustic Features
• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the
Lowest (i.e., Phonetic) Tier of Spoken Language
– AFs can be combined in a variety of ways to specify virtually any speech sound found in
the world’s languages
– AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments
• AFs are Systematically Organized at the Level of the Syllable
– Syllables are a basic articulatory unit in speech
– The pronunciation patterns observed in casual conversation are systematic at the AF level,
but not at the phonetic-segment level, and therefore can be used to develop more
accurate and flexible pronunciation models than phonetic segments
Rationale for Articulatory-Acoustic Features
• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the
Lowest (i.e., phonetic) Tier of Spoken Language
– AFs can be combined in a multitude of ways to specify virtually any speech sound found in
the world’s languages
– AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments
• AFs are Systematically Organized at the Level of the Syllable
– Syllables are a basic articulatory unit in speech
– The pronunciation patterns observed in casual conversation are systematic at the AF level,
but not at the phonetic-segment level, and therefore can be used to develop more
accurate and flexible pronunciation models than phonetic segments
• AFs are Potentially More Effective in Speech Recognition Systems
– More accurate and flexible pronunciation models (tied to syllabic and lexical units)
Rationale for Articulatory-Acoustic Features
• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the
Lowest (i.e., phonetic) Tier of Spoken Language
– AFs can be combined in a multitude of ways to specify virtually any speech sound found in
the world’s languages
– AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments
• AFs are Systematically Organized at the Level of the Syllable
– Syllables are a basic articulatory unit in speech
– The pronunciation patterns observed in casual conversation are systematic at the AF level,
but not at the phonetic-segment level, and therefore can be used to develop more
accurate and flexible pronunciation models than phonetic segments
• AFs are Potentially More Effective in Speech Recognition Systems
– More accurate and flexible pronunciation models (tied to syllabic and lexical units)
– Are generally more robust under acoustic interference than phonetic segments
Rationale for Articulatory-Acoustic Features
• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the
Lowest (i.e., phonetic) Tier of Spoken Language
– AFs can be combined in a multitude of ways to specify virtually any speech sound found in
the world’s languages
– AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments
• AFs are Systematically Organized at the Level of the Syllable
– Syllables are a basic articulatory unit in speech
– The pronunciation patterns observed in casual conversation are systematic at the AF level,
but not at the phonetic-segment level, and therefore can be used to develop more
accurate and flexible pronunciation models than phonetic segments
• AFs are Potentially More Effective in Speech Recognition Systems
– More accurate and flexible pronunciation models (tied to syllabic and lexical units)
– Are generally more robust under acoustic interference than phonetic segments
– Relatively few alternative features for various AF dimensions makes classification
inherently more robust than phonetic segments
Rationale for Articulatory-Acoustic Features
• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the
Lowest (i.e., phonetic) Tier of Spoken Language
– AFs can be combined in a multitude of ways to specify virtually any speech sound found in
the world’s languages
– AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments
• AFs are Systematically Organized at the Level of the Syllable
– Syllables are a basic articulatory unit in speech
– The pronunciation patterns observed in casual conversation are systematic at the AF level,
but not at the phonetic-segment level, and therefore can be used to develop more
accurate and flexible pronunciation models than phonetic segments
• AFs are Potentially More Effective in Speech Recognition Systems
– More accurate and flexible pronunciation models (tied to syllabic and lexical units)
– Are generally more robust under acoustic interference than phonetic segments
– Relatively few alternative features for various AF dimensions makes classification
inherently more robust than phonetic segments
• AFs are Potentially More Effective in Speech Synthesis Systems
– More accurate and flexible pronunciation models (tied to syllabic and lexical units)
Primary Development Corpus – NTIMIT
•
Sentences Read by Native Speakers of American English
– Quasi-phonetically balanced set of materials
– Wide range of dialect variability , both genders, variation in speaker age
– Relatively low semantic predictability
“She washed his dark suit in greasy wash water all year”
Primary Development Corpus – NTIMIT
•
Sentences Read by Native Speakers of American English
– Quasi-phonetically balanced set of materials
– Wide range of dialect variability , both genders, variation in speaker age
– Relatively low semantic predictability
“She washed his dark suit in greasy wash water all year”
•
Corpus Manually Labeled and Segmented at the Phonetic-Segment Level
– The precision of phonetic annotation provides an excellent training corpus
– Corpus was annotated at MIT
Primary Development Corpus – NTIMIT
•
Sentences Read by Native Speakers of American English
– Quasi-phonetically balanced set of materials
– Wide range of dialect variability , both genders, variation in speaker age
– Relatively low semantic predictability
“She washed his dark suit in greasy wash water all year”
•
Corpus Manually Labeled and Segmented at the Phonetic-Segment Level
– The precision of phonetic annotation provides an excellent training corpus
– Corpus was annotated at MIT
•
A Large Amount of Annotated Material
– Over 2.5 hours of material used for training the classifiers
– 20 minutes of material used for testing
Primary Development Corpus – NTIMIT
•
Sentences Read by Native Speakers of American English
– Quasi-phonetically balanced set of materials
– Wide range of dialect variability , both genders, variation in speaker age
– Relatively low semantic predictability
“She washed his dark suit in greasy wash water all year”
•
Corpus Manually Labeled and Segmented at the Phonetic-Segment Level
– The precision of phonetic annotation provides an excellent training corpus
– Corpus was annotated at MIT
•
A Large Amount of Annotated Material
– Over 2.5 hours of material used for training the classifiers
– 20 minutes of material used for testing
•
Relatively Canonical Pronunciation Ideal for Training AF Classifiers
– Formal pronunciation patterns provides a means of deriving articulatory
features from phonetic-segment labels via mapping rules (cf. Proceedings
paper for details)
Primary Development Corpus – NTIMIT
•
Sentences Read by Native Speakers of American English
– Quasi-phonetically balanced set of materials
– Wide range of dialect variability , both genders, variation in speaker age
– Relatively low semantic predictability
“She washed his dark suit in greasy wash water all year”
•
Corpus Manually Labeled and Segmented at the Phonetic-Segment Level
– The precision of phonetic annotation provides an excellent training corpus
– Corpus was annotated at MIT
•
A Large Amount of Annotated Material
– Over 2.5 hours of material used for training the classifiers
– 20 minutes of material used for testing
•
Relatively Canonical Pronunciation Ideal for Training AF Classifiers
– Formal pronunciation patterns provides a means of deriving articulatory
features from phonetic-segment labels via mapping rules
•
NTIMIT is a Telephone Pass-band Version of the TIMIT Corpus
– Sentential material passed through a channel between 0.3 and 3.4 kHz
– Provides capability of transfer to other telephone corpora (such as VIOS)
Part Two
THE ELITIST APPROACH
The Baseline System for Articulatory-Acoustic Feature Classification
The ELITIST Approach to Systematic Frame Selection for AF Classification
Improving Place-of-Articulation Classification Using Manner-Specific Training
The Baseline System for AF Classification
•
Spectro-Temporal Representation of the Speech Signal
– Derived from logarithmically compressed, critical-band energy pattern
– 25-ms analysis windows (i.e., a frame)
– 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames)
Place of Articulation – A Brief Primer
The tongue contacts (or nearly so)
the roof of the mouth in producing
many of the consonantal sounds
in English
Anterior
Labial
Labio-dental
Inter-dental
[p] [b] [m]
[f] [v]
[th] [dh]
Central
Alveolar
[t] [d] [n] [s] [z]
Posterior
Palatal
Velar
[sh] [zh]
[k] [g] [ng]
Chameleon
Rhoticized
Lateral
Approximant
[r]
[l]
[hh]
From Daniloff (1973)
The Baseline System for AF Classification
•
Spectro-Temporal Representation of the Speech Signal
– Derived from logarithmically compressed, critical-band energy pattern
– 25-ms analysis windows (i.e., a frame)
– 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames)
The Baseline System for AF Classification
•
Spectro-Temporal Representation of the Speech Signal
– Derived from logarithmically compressed, critical-band energy pattern
– 25-ms analysis windows (i.e., a frame)
– 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames)
•
Multilayer Perceptron (MLP) Neural Network Classifiers
– Single hidden layer of 200-400 units, trained with back-propagation
– Nine frames of context used in the input
The Baseline System for AF Classification
•
An MLP Network for Each Articulatory Feature (AF) Dimension
– A separate network trained on voicing, place and manner of articulation, etc.
– Training targets were derived from hand-labeled phonetic transcripts and a
fixed phone-to-AF mapping
– “Silence” was a feature included in the classification of each AF dimension
– All of the results reported are for FRAME accuracy (not segmental accuracy)
The Baseline System for AF Classification
•
Focus on Articulatory Feature Classification Rather than Phone Identity
– Provides a more accurate means of assessing MLP-based classification
system
Baseline System Performance Summary
•
Classification of Articulatory Features Exceeds 80% – Except for Place
•
Objective – Improve Classification across All AF Dimensions, but
Particularly on Place-of-Articulation
NTIMIT Corpus
Not All Frames are Created Equal
•
Correlation Between Frame Position and Classification Accuracy for
MANNER of articulation features:
– The 20% of the frames closest to the segment BOUNDARIES are 73% correct
– The 20% of the frames closest to the segment CENTER are 90% correct
Not All Frames are Created Equal
•
Correlation Between Frame Position and Classification Accuracy for
MANNER of articulation features:
– The 20% of the frames closest to the segment BOUNDARIES are 73% correct
– The 20% of the frames closest to the segment CENTER are 90% correct
•
Correlation between frame position within a segment and classifier output
for MANNER features:
– The 20% of the frames closest to the segment BOUNDARIES have a mean
maximum output (“confidence”) level of 0.797
– The 20% of the frames closest to the segment CENTER have a mean
maximum output (“confidence”) level of 0.892
– This dynamic range of 0.1 (in absolute terms) is HIGHLY significant
Not All Frames are Created Equal
•
Manner Classification is Best for Frames in the Phonetic-Segment Center
Not All Frames are Created Equal
•
Manner Classification is Best for Frames in the Phonetic-Segment Center
•
MLP Network Confidence Level is Highly Correlated with Frame Accuracy
Not All Frames are Created Equal
•
Manner Classification is Best for Frames in the Phonetic-Segment Center
•
MLP Network Confidence Level is Highly Correlated with Frame Accuracy
•
The Most Confidently Classified Frames are Generally More Accurate
Selecting a Threshold for Frame Selection
•
The Correlation Between Neural Network Confidence Level and Frame
Position within the Phonetic Segment Can Be Exploited to Enhance
Articulatory Feature Classification
– This insight provides the basis for the “Elitist” approach
Selecting a Threshold for Frame Selection
•
The Most Confidently Classified Frames are Generally More Accurate
Selecting a Threshold for Frame Selection
•
The Most Confidently Classified Frames are Generally More Accurate
•
Frames with Confidence Levels Below “Threshold” are Discarded
– Setting the threshold to 0.7 filters out ca. 20% of the frames
– Boundary frames are twice as likely to be discarded as central frames
Criterion
Selecting a Threshold for Frame Selection
•
The Most Confidently Classified Frames are Generally More Accurate
•
Frames with Confidence Levels Below “Threshold” are Discarded
– Setting the threshold to 0.7 filters out ca. 20% of the frames
– Boundary frames are twice as likely to be discarded as central frames
•
Primary Drawback of Using This Threshold for Frame Selection
– 6% of the phonetic segments have most of their frames discarded
Criterion
The Elitist Approach to Manner Classification
•
The Accuracy of MANNER Frame Classification Improves
– Frame-level classification accuracy increases overall from 85% to 93%
The Elitist Approach to Manner Classification
•
The Accuracy of MANNER Frame Classification Improves
– Frame-level classification accuracy increases overall from 85% to 93%
•
Certain Manner Classes Improve Highly with Frame Selection
– Nasals, Stops, Fricatives, Flaps all show strong improvement in performance
Manner-Dependency for Place of Articulation
•
Objective – Reduce the Number of Place Features to Classify for Any
Single Manner Class
–
–
–
–
Although there are NINE distinct place of articulation features overall ...
For any single manner class there are only three or four place features
The specific PLACES of articulation for stops differs from fricatives, etc.
HOWEVER, the SPATIAL PATTERNING of the constriction loci is SIMILAR
Manner-Dependency for Place of Articulation
•
Objective – Reduce the Number of Place Features to Classify for Any
Single Manner Class
–
–
–
–
•
Although there are NINE distinct place of articulation features overall ...
For any single manner class there are only three or four place features
The specific PLACES of articulation for stops differs from fricatives, etc.
HOWEVER, the SPATIAL PATTERNING of the constriction loci is SIMILAR
Because Classification Accuracy for Manner Features is High, MannerSpecific Training for Place of Articulation is Feasible (as we’ll show you)
Manner-Specific Place Classification
Thus, Each Manner Class can be Trained on Comparable Relational Place
Features:
ANTERIOR
CENTRAL
POSTERIOR
Manner-Specific Place Classification
Thus, Each Manner Class can be Trained on Comparable Relational Place
Features:
ANTERIOR
CENTRAL
POSTERIOR
Classifying Place of Articulation in Manner-Specific Fashion Can Improve
the Classification Accuracy of this Feature Dimension
– The training material is far more homogeneous under this regime and is thus
more reliable and robust
NTIMIT (telephone) Corpus
Manner-Specific Classification – Vowels
•
Knowing the “Manner” Improves “Place” Classification for Vowels as Well
•
Also Improves “Height” Classification
NTIMIT (telephone) Corpus
Manner-Specific Place Classification - Overall
•
Overall, Performance Improves Between 5% and 14% (in absolute terms)
•
Improvement is Greatest for Stops, Nasals and Flaps
NTIMIT (telephone) Corpus
Manner Transcription of Spontaneous Speech
Automatic segmentation and labeling of articulatory manner was used as a
guide for phonetic labeling and segmentation of the Switchboard corpus
Manner Transcription of Spontaneous Speech
Automatic segmentation and labeling of articulatory manner was used as a
guide for phonetic labeling and segmentation of the Switchboard corpus
The manner segmentation is relatively isomorphic with phone segmentation
Summary – ELITIST Approach
•
A Principled Method of Frame Selection (the ELITIST approach) can be
Used to Improve the Accuracy of Articulatory Feature Classification
Summary – ELITIST Approach
•
A Principled Method of Frame Selection (the ELITIST approach) can be
Used to Improve the Accuracy of Articulatory Feature Classification
•
The ELITIST Approach is Based on the Observation that Frames In the
Center of Phonetic Segments are More Accurately Classified than
Those at Segment Boundaries
Summary – ELITIST Approach
•
A Principled Method of Frame Selection (the ELITIST approach) can be
Used to Improve the Accuracy of Articulatory Feature Classification
•
The ELITIST Approach is Based on the Observation that Frames In the
Center of Phonetic Segments are More Accurately Classified than
Those at Segment Boundaries
•
Frame Classification Accuracy is Highly Correlated with MLP Network
Confidence Level and can be Used to Systematically Discard Frames
Summary – ELITIST Approach
•
A Principled Method of Frame Selection (the ELITIST approach) can be
Used to Improve the Accuracy of Articulatory Feature Classification
•
The ELITIST Approach is Based on the Observation that Frames In the
Center of Phonetic Segments are More Accurately Classified than
Those at Segment Boundaries
•
Frame Classification Accuracy is Highly Correlated with MLP Network
Confidence Level and can be Used to Systematically Discard Frames
•
Discarding such Low-Confidence Frames Improves AF Classification
Summary – ELITIST Approach
•
A Principled Method of Frame Selection (the ELITIST approach) can be
Used to Improve the Accuracy of Articulatory Feature Classification
•
The ELITIST Approach is Based on the Observation that Frames In the
Center of Phonetic Segments are More Accurately Classified than
Those at Segment Boundaries
•
Frame Classification Accuracy is Highly Correlated with MLP Network
Confidence Level and can be Used to Systematically Discard Frames
•
Discarding such Low-Confidence Frames Improves AF Classification
•
Manner Classification is Sufficiently Improved as to be Capable of
Performing Manner-Specific Training for Place-of-Articulation Features
Summary – ELITIST Approach
•
A Principled Method of Frame Selection (the ELITIST approach) can be
Used to Improve the Accuracy of Articulatory Feature Classification
•
The ELITIST Approach is Based on the Observation that Frames In the
Center of Phonetic Segments are More Accurately Classified than
Those at Segment Boundaries
•
Frame Classification Accuracy is Highly Correlated with MLP Network
Confidence Level and can be Used to Systematically Discard Frames
•
Discarding such Low-Confidence Frames Improves AF Classification
•
Manner Classification is Sufficiently Improved as to be Capable of
Performing Manner-Specific Training for Place-of-Articulation Features
•
Place of Articulation Feature Classification Improves using MannerSpecific Training
Summary – ELITIST Approach
•
A Principled Method of Frame Selection (the ELITIST approach) can be
Used to Improve the Accuracy of Articulatory Feature Classification
•
The ELITIST Approach is Based on the Observation that Frames In the
Center of Phonetic Segments are More Accurately Classified than
Those at Segment Boundaries
•
Frame Classification Accuracy is Highly Correlated with MLP Network
Confidence Level and can be Used to Systematically Discard Frames
•
Discarding such Low-Confidence Frames Improves AF Classification
•
Manner Classification is Sufficiently Improved as to be Capable of
Performing Manner-Specific Training for Place-of-Articulation Features
•
Place of Articulation Feature Classification Improves using MannerSpecific Training
•
This Performance Enhancement is Probably the Result of:
– Fewer features to classify for any given manner class
– More homogeneous place-of-articulation training material
Summary – ELITIST Approach
•
A Principled Method of Frame Selection (the ELITIST approach) can be
Used to Improve the Accuracy of Articulatory Feature Classification
•
The ELITIST Approach is Based on the Observation that Frames In the
Center of Phonetic Segments are More Accurately Classified than
Those at Segment Boundaries
•
Frame Classification Accuracy is Highly Correlated with MLP Network
Confidence Level and can be Used to Systematically Discard Frames
•
Discarding such Low-Confidence Frames Improves AF Classification
•
Manner Classification is Sufficiently Improved as to be Capable of
Performing Manner-Specific Training for Place-of-Articulation Features
•
Place of Articulation Feature Classification Improves using MannerSpecific Training
•
This Performance Enhancement is Probably the Result of:
– Fewer features to classify for any given manner class
– More homogeneous place-of-articulation training material
•
Such Improvements in AF Classification Accuracy Can Be Used to
Improve the Quality of Automatic Phonetic Annotation
Part Three
THE ELITIST APPROACH
GOES DUTCH
Description of the Development Corpus - VIOS
The Nature of Cross-Linguistic Transfer of Articulatory Features
Application of the ELITIST Approach to Dutch
Manner-Specific, Place-of-Articulation Classification for Dutch
Dutch Development Corpus – VIOS
•
Extemporaneous, Prompted Human-Machine Telephone Dialogues
– Human speakers querying an automatic system for Dutch Railway timetables
– Wide range of dialect variability , both genders, variation in speaker age
Dutch Development Corpus – VIOS
•
Extemporaneous, Prompted Human-Machine Telephone Dialogues
– Human speakers querying an automatic system for Dutch Railway timetables
– Wide range of dialect variability , both genders, variation in speaker age
•
A Portion of the Corpus Manually Labeled at the Phonetic-Segment Level
– Material labeled by speech science students at Nijmegen University
– This component of the corpus served as the testing material
– There was 18 minutes of material in this portion of the corpus
Dutch Development Corpus – VIOS
•
Extemporaneous, Prompted Human-Machine Telephone Dialogues
– Human speakers querying an automatic system for Dutch Railway timetables
– Wide range of dialect variability , both genders, variation in speaker age
•
A Portion of the Corpus Manually Labeled at the Phonetic-Segment Level
– Material labeled by speech science students at Nijmegen University
– This component of the corpus served as the testing material
– There was 18 minutes of material in this portion of the corpus
•
The Major Portion of the Corpus Automatically Labeled and Segmented
– The automatic method incorporated a certain degree of pronunciation-model
knowledge derived from language-specific phonological rules
– This part of the corpus served as the training material
– There was 60 minutes of material in this portion of the corpus
How Dutch Differs from English
•
Dutch and English are Genetically Closely Related Languages
– Perhaps 1500 years of time depth separating the languages
– They share some (but not all - see below) phonetic properties in common
How Dutch Differs from English
•
Dutch and English are Genetically Closely Related Languages
– Perhaps 1500 years of time depth separating the languages
– They share some (but not all - see below) phonetic properties in common
•
The “Dental” Place of Articulation is Present in English, but not in Dutch
How Dutch Differs from English
•
Dutch and English are Genetically Closely Related Languages
– Perhaps 1500 years of time depth separating the languages
– They share some (but not all - see below) phonetic properties in common
•
The “Dental” Place of Articulation is Present in English, but not in Dutch
•
The Manner “Flap” is Present in English, but not in Dutch
How Dutch Differs from English
•
Dutch and English are Genetically Closely Related Languages
– Perhaps 1500 years of time depth separating the languages
– They share some (but not all - see below) phonetic properties in common
•
The “Dental” Place of Articulation is Present in English, but not in Dutch
•
The Manner “Flap” is Present in English, but not in Dutch
•
Certain Manner/Place Combinations in Dutch are not Found in English
– For example – the velar fricative associated with orthographic “g”
How Dutch Differs from English
•
Dutch and English are Genetically Closely Related Languages
– Perhaps 1500 years of time depth separating the languages
– They share some (but not all - see below) phonetic properties in common
•
The “Dental” Place of Articulation is Present in English, but not in Dutch
•
The Manner “Flap” is Present in English, but not in Dutch
•
Certain Manner/Place Combinations in Dutch are not Found in English
– For example – the velar fricative associated with orthographic “g”
•
The Vocalic System (particularly diphthongs) Differs Between Dutch and
English
Cross-Linguistic Classification
•
Classification Accuracy on the VIOS Corpus
– Results depend on whether the classifiers were trained on VIOS (Dutch) or
NTIMIT (English) material
Cross-Linguistic Classification
•
Classification Accuracy on the VIOS Corpus
– Results depend on whether the classifiers were trained on VIOS (Dutch) or
NTIMIT (English) material
– Voicing and manner classification is comparable between the two training
corpora
Cross-Linguistic Classification
•
Classification Accuracy on the VIOS Corpus
– Results depend on whether the classifiers were trained on VIOS (Dutch) or
NTIMIT (English) material
– Voicing and manner classification is comparable between the two training
corpora
– Place classification is significantly worse when training on NTIMIT
Cross-Linguistic Classification
•
Classification Accuracy on the VIOS Corpus
– Results depend on whether the classifiers were trained on VIOS (Dutch) or
NTIMIT (English) material
– Voicing and manner classification is comparable between the two training
corpora
– Place classification is significantly worse when training on NTIMIT
– Other feature dimensions exhibit only slightly worse performance training on
NTIMIT
The Elitist Approach Applied to Dutch
For VIOS-trained Classifiers
•
Frames with Confidence Levels Below “Threshold” are Discarded
– Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to
6% of the segments
The Elitist Approach Applied to Dutch
For VIOS-trained Classifiers
•
Frames with Confidence Levels Below “Threshold” are Discarded
– Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to
6% of the segments
•
The Accuracy of MANNER Frame Classification Improves
– Frame-level classification accuracy increases from 85% to 91%
The Elitist Approach Applied to Dutch
For VIOS-trained Classifiers
•
Frames with Confidence Levels Below “Threshold” are Discarded
– Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to
6% of the segments
•
The Accuracy of MANNER Frame Classification Improves
– Frame-level classification accuracy increases from 85% to 91%
For NTIMIT-trained Classifiers (but classifying VIOS material)
•
Frames with Confidence Levels Below “Threshold” are Discarded
– Setting the threshold to 0.7 filters out ca. 19% of the frames
The Elitist Approach Applied to Dutch
For VIOS-trained Classifiers
•
Frames with Confidence Levels Below “Threshold” are Discarded
– Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to
6% of the segments
•
The Accuracy of MANNER Frame Classification Improves
– Frame-level classification accuracy increases from 85% to 91%
For NTIMIT-trained Classifiers (but classifying VIOS material)
•
Frames with Confidence Levels Below “Threshold” are Discarded
– Setting the threshold to 0.7 filters out ca. 19% of the frames
•
The Accuracy of MANNER Frame Classification Improves
– Frame-level classification accuracy increases from 73% to 81%
Place of Articulation is Manner-Dependent
•
Although There are Nine Distinct Place of Articulation Features Overall
Place of Articulation is Manner-Dependent
•
Although There are Nine Distinct Place of Articulation Features Overall
•
For Any Single Manner Class There are Only Three Place Features
Place of Articulation is Manner-Dependent
•
Although There are Nine Distinct Place of Articulation Features Overall
•
For Any Single Manner Class There are Only Three Place Features
•
The Locus of Articulation Constriction Differs Among Manner Classes
Place of Articulation is Manner-Dependent
•
Thus, if the Manner is Classified Correctly, this Information can be
Exploited to Enhance Place of Articulation Classification
Place of Articulation is Manner-Dependent
•
Thus, if the Manner is Classified Correctly, this Information can be
Exploited to Enhance Place of Articulation Classification
•
Thus, Each Manner Class can be Trained on Comparable Relational Place
Features:
ANTERIOR
CENTRAL
POSTERIOR
Place of Articulation is Manner-Dependent
•
Thus, if the Manner is Classified Correctly, this Information can be
Exploited to Enhance Place of Articulation Classification
•
Thus, Each Manner Class can be Trained on Comparable Relational Place
Features:
ANTERIOR
CENTRAL
POSTERIOR
•
Knowing the “Manner” Improves “Place” Classification for both
Consonants and Vowels in DUTCH
VIOS (telephone) Corpus
Manner-Specific Place Classification – Dutch
•
Knowing the “Manner” Improves “Place” Classification for the
“Approximant” Segments in DUTCH
VIOS (telephone) Corpus
Manner-Specific Place Classification – Dutch
•
Knowing the “Manner” Improves “Place” Classification for the
“Approximant” Segments in DUTCH
•
Approximants are Classified as “Vocalic” Rather Than as “Consonantal”
VIOS (telephone) Corpus
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
– Place classification (for VIOS) is much worse when trained on NTIMIT
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
– Place classification (for VIOS) is much worse when trained on NTIMIT
– Other AF dimensions are only slightly worse when trained on NTIMIT
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
– Place classification (for VIOS) is much worse when trained on NTIMIT
– Other AF dimensions are only slightly worse when trained on NTIMIT
• Application of the ELITIST Approach to the VIOS Corpus
– Results improve when the ELITIST approach is used
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
– Place classification (for VIOS) is much worse when trained on NTIMIT
– Other AF dimensions are only slightly worse when trained on NTIMIT
• Application of the ELITIST Approach to the VIOS Corpus
– Results improve when the ELITIST approach is used
– Training on VIOS:
frame-level classification accuracy increases from 85% to 91%
(15% of the frames discarded)
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
– Place classification (for VIOS) is much worse when trained on NTIMIT
– Other AF dimensions are only slightly worse when trained on NTIMIT
• Application of the ELITIST Approach to the VIOS Corpus
– Results improve when the ELITIST approach is used
– Training on VIOS:
frame-level classification accuracy increases from 85% to 91%
(15% of the frames discarded)
– Training on NTIMIT:
frame-level classification accuracy increases from 73% to 81%
(19% of frames discarded)
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
– Place classification (for VIOS) is much worse when trained on NTIMIT
– Other AF dimensions are only slightly worse when trained on NTIMIT
• Application of the ELITIST Approach to the VIOS Corpus
– Results improve when the ELITIST approach is used
– Training on VIOS:
frame-level classification accuracy increases from 85% to 91%
(15% of the frames discarded)
– Training on NTIMIT:
frame-level classification accuracy increases from 73% to 81%
(19% of frames discarded)
• Manner-Specific Classification for Place of Articulation Features
– Knowing the “manner” improves “place” classification for vowels and for
consonants
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
– Place classification (for VIOS) is much worse when trained on NTIMIT
– Other AF dimensions are only slightly worse when trained on NTIMIT
• Application of the ELITIST Approach to the VIOS Corpus
– Results improve when the ELITIST approach is used
– Training on VIOS:
frame-level classification accuracy increases from 85% to 91%
(15% of the frames discarded)
– Training on NTIMIT:
frame-level classification accuracy increases from 73% to 81%
(19% of frames discarded)
• Manner-Specific Classification for Place of Articulation Features
– Knowing the “manner” improves “place” classification for vowels and for
consonants
– Accuracy increases between 10 and 20% (absolute) for all “place” features
Summary – ELITIST Goes Dutch
• Cross-linguistic Transfer of Articulatory Features
– Classifiers are more than 80% correct on all AF dimensions except for
“place” when trained and tested on VIOS
– Voicing and manner classification is comparable between VIOS and NTIMIT
– Place classification (for VIOS) is much worse when trained on NTIMIT
– Other AF dimensions are only slightly worse when trained on NTIMIT
• Application of the ELITIST Approach to the VIOS Corpus
– Results improve when the ELITIST approach is used
– Training on VIOS:
frame-level classification accuracy increases from 85% to 91%
(15% of the frames discarded)
– Training on NTIMIT:
frame-level classification accuracy increases from 73% to 81%
(19% of frames discarded)
• Manner-Specific Classification for Place of Articulation Features
– Knowing the “manner” improves “place” classification for vowels and for
consonants
– Accuracy increases between 10 and 20% (absolute) for all “place” features
– Approximants are classified as “vocalic” not “consonantal” – knowing the
“manner” improves “place” classification for “approximant” segments
Part Five
Automatic Stress Accent Labeling
In collaboration with Shawn Chang
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Heavy
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Heavy
Light
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Heavy
Light
None
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Heavy
Light
None
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Heavy
Light
None
(In actuality, labelers assigned a “1” to fully accented syllables, a “null” to
completely unaccented syllables, and a “0.5” to all others)
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Heavy
Light
None
(In actuality, labelers assigned a “1” to fully accented syllables, a “null” to
completely unaccented syllables, and a “0.5” to all others)
An example of the annotation (attached to the vocalic nucleus) is shown
below (where the accent levels could not be derived from a dictionary)
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Heavy
Light
None
(In actuality, labelers assigned a “1” to fully accented syllables, a “null” to
completely unaccented syllables, and a “0.5” to all others)
An example of the annotation (attached to the vocalic nucleus) is shown
below (where the accent levels could not be derived from a dictionary)
In this example most of the syllables are unaccented, with two labeled as
lightly accented (0.5)
Annotation of Stress Accent
Forty-five minutes of the Switchboard corpus was manually labeled with
respect to stress accent using perceptual criteria
Three levels of accent were distinguished:
Heavy
Light
None
(In actuality, labelers assigned a “1” to fully accented syllables, a “null” to
completely unaccented syllables, and a “0.5” to all others)
An example of the annotation (attached to the vocalic nucleus) is shown
below (where the accent levels could not be derived from a dictionary)
In this example most of the syllables are unaccented, with two labeled as
lightly accented (0.5) (and one other labeled as very lightly accented (0.25))
Annotation of Stress Accent
The data are available at ….
Annotation of Stress Accent
The data are available at ….
http://www.icsi/berkeley.edu/~steveng/prosody
Automatic Labeling of Stress Accent
This forty-five minutes of hand-labeled prosodic (and phonetic) annotation
from the Switchboard corpus was used as training data for development of
an Automatic Stress Accent Labeling System (AutoSAL)
How Good is AutoSAL?
There is an 79% concordance between human and machine accent labels
when the tolerance level is a quarter-step
How Good is AutoSAL?
There is an 79% concordance between human and machine accent labels
when the tolerance level is a quarter-step
There is 97.5% concordance when the tolerance level is half a step
How Good is AutoSAL?
There is an 79% concordance between human and machine accent labels
when the tolerance level is a quarter-step
There is 97.5% concordance when the tolerance level is half a step
This degree of concordance is as high as that exhibited by two highly
trained (human) transcribers
Acoustic/Phonetic/Linguistic Basis of Accent
What are the most important features for simulating stress-accent labeling
using AutoSAL?
AutoSAL Features – The Full Monty
What are the most important features for simulating stress-accent labeling
using AutoSAL?
Acoustic/Phonetic/Linguistic Basis of Accent
What are the most important features for simulating stress-accent labeling
using AutoSAL?
Duration
Acoustic/Phonetic/Linguistic Basis of Accent
What are the most important features for simulating stress-accent labeling
using AutoSAL?
Duration, (normalized) energy
Acoustic/Phonetic/Linguistic Basis of Accent
What are the most important features for simulating stress-accent labeling
using AutoSAL?
Duration, (normalized) energy, vocalic identity
Acoustic/Phonetic/Linguistic Basis of Accent
What are the most important features for simulating stress-accent labeling
using AutoSAL?
Duration, (normalized) energy, vocalic identity, (and its acoustic correlates)
Acoustic/Phonetic/Linguistic Basis of Accent
What are the most important features for simulating stress-accent labeling
using AutoSAL?
Duration, (normalized) energy, vocalic identity (and its acoustic correlates)
Pitch-related features are (relatively) unimportant for stress-accent labeling
Acoustic/Phonetic/Linguistic Basis of Accent
What are the most important features for simulating stress-accent labeling
using AutoSAL?
Duration, (normalized) energy, vocalic identity (and its acoustic correlates)
Pitch-related features are (relatively) unimportant for stress-accent labeling
The Full Monty….
AutoSAL – The Full Monty (in text)
45 feature sets were used in a
near-exhaustive search for
the most relevant parameters
associated with stress accent
AutoSAL Features – The Full Monty
What are the most important features for simulating stress-accent labeling
using AutoSAL?
Duration, (normalized) energy, vocalic identity (and its acoustic correlates)
Pitch-related features are (relatively) unimportant for stress-accent labeling
Part Six
INTO THE FUTURE
Towards Fully Automatic Transcription Systems
An Empirically Oriented Discipline Based on Annotated Corpora
The Eternal Pentangle
• Phonetic and Prosodic Annotation is Limited in Quantity
The Eternal Pentangle
• Phonetic and Prosodic Annotation is Limited in Quantity
– This material is important for understanding spoken language and developing
superior technology for recognition and synthesis
The Eternal Pentangle
• Phonetic and Prosodic Annotation is Limited in Quantity
– This material is important for understanding spoken language and developing
superior technology for recognition and synthesis
I Have a Dream, That One Day ….
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
– Articulatory-acoustic features
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
– Articulatory-acoustic features
– Phonetic segments
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
– Articulatory-acoustic features
– Phonetic segments
– Pronunciation variation
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
Audio and video detail pertaining to all aspects of spoken language
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
Audio and video detail pertaining to all aspects of spoken language
• That a Science of Spoken Language will be Empirically Based
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
Audio and video detail pertaining to all aspects of spoken language
• That a Science of Spoken Language will be Empirically Based
– Using these annotated corpora to perform detailed statistical analyses
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
Audio and video detail pertaining to all aspects of spoken language
• That a Science of Spoken Language will be Empirically Based
– Using these annotated corpora to perform detailed statistical analyses
– Generating hypotheses about the organization and function of spoken language
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
Audio and video detail pertaining to all aspects of spoken language
• That a Science of Spoken Language will be Empirically Based
– Using these annotated corpora to perform detailed statistical analyses
– Generating hypotheses about the organization and function of spoken language
– Performing experiments based on insights garnered from such corpora
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
Audio and video detail pertaining to all aspects of spoken language
• That a Science of Spoken Language will be Empirically Based
– Using these annotated corpora to perform detailed statistical analyses
– Generating hypotheses about the organization and function of spoken language
– Performing experiments based on insights garnered from such corpora
• That Such Corpora will be Used to Develop Wonderful Technology
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
Audio and video detail pertaining to all aspects of spoken language
• That a Science of Spoken Language will be Empirically Based
– Using these annotated corpora to perform detailed statistical analyses
– Generating hypotheses about the organization and function of spoken language
– Performing experiments based on insights garnered from such corpora
• That Such Corpora will be Used to Develop Wonderful Technology
– To create “flawless” speech recognition
I Have a Dream, That One Day ….
• There will be Annotated Corpora for All Major Languages of the World
(generated by automatic means, but based on manual annotation)
• That Each of These Corpora will Contain Detailed Information About:
–
–
–
–
–
–
–
–
–
Articulatory-acoustic features
Phonetic segments
Pronunciation variation
Syllable units
Lexical representations
Prosodic information pertaining to accent and intonation
Morphological patterns, as well as syntactic and grammatical material
Semantics and its relation to the lower tiers of spoken language
Audio and video detail pertaining to all aspects of spoken language
• That a Science of Spoken Language will be Empirically Based
– Using these annotated corpora to perform detailed statistical analyses
– Generating hypotheses about the organization and function of spoken language
– Performing experiments based on insights garnered from such corpora
• That Such Corpora will be Used to Develop Wonderful Technology
– To create “flawless” speech recognition
– And “perfect” speech synthesis
That’s All, Folks
Many Thanks for Your Time and Attention
Phonetic Transcription
How was the Labeling and Segmentation Performed?
VERY carefully …. by UC-Berkeley linguistics students
Using a display of the signal waveform, spectrogram, word transcription and
“forced alignments” (automatic estimates of phones and boundaries) + audio
(listening at multiple time scales - phone, word, utterance) on Sun workstations
Additionally, automatic segmentation and labeling of articulatory manner was
used as a guide for phonetic labeling and segmentation in the current year
Download