functional load

advertisement
Three examples
of sound-system research
using web-available materials
Andy Wedel
LSA Summer Institute: The Data Goldmine
July 9, 2015
1. Functional load and diachronic phoneme
inventory change
– Published literature on sound-change in
combination with phonemically-coded corpora
2. Lexical competition and hyperarticulation in
natural speech
– Phonetic measures in the Buckeye Corpus in
combination with lexical data on English
3. Correlation between crosslinguistic and
language-internal phoneme frequencies
– A database of phoneme inventories combined
with available phonemically-coded corpora
Organizational steps in research
• What is the question?
– Identify your general hypothesis
• What is the approach?
– Operationalize your hypothesis
– Develop a method/experiment
• Find data/create materials
• Analysis/Results
• Dissemination
1. Functional load and diachronic
phoneme inventory change
With: Abby Kaplan
Department of Linguistics
University of Utah
Scott Jackson
Center for the Advanced Study
of Language (CASL)
University of Maryland
Functional load and diachronic
phoneme inventory change
1.
2.
3.
4.
What is the question?
What is the approach?
Find data/create materials
Analysis/Results
Phoneme inventories change over time
6
Does the functional load of a
phoneme contrast influence
its trajectory of change?
– Gilliéron (1918), Jakobson
(1931), Mathesius (1931),
Trubetzkoy (1939)
– Martinet (1952), King
(1967), Hockett (1967)
– Surendran & Niyogi (2006),
Silverman (2011), Kaplan
(2011)
Functional load
“The notion of functional load is
that a phonemic system … has
a (quantifiable) job to do, and
that the contrast between any
two phonemes, say /a/ and /b/,
carries its share.”
Charles Hockett 1967
8
Functional load
Specific Hypothesis:
Neutralization is less likely for contrasts
that have a higher functional load.
(Martinet 1955, Hockett 1967)
9
Phoneme Mergers
/ ɑ ~ ɔ / merger in western American English
ɔ
ɑ
cot
caught
How has functional load been
operationalized?
• In terms of the lexicon:
– Number of minimal pairs (Martinet 1955)
• Various ways of counting number of homophones
(Silverman 2009, Kaplan in press)
– Lexical level entropy (Surendran and Niyogi 2006)
• In terms of the sound system
– Type or token phoneme frequency (Currie-Hall
2010)
– Phoneme level entropy (Hockett 1967, King 1967,
Surendran and Niyogi 2006)
Why hadn’t this been
successfully tested before?
• Previous approaches involve case-studies:
1. Find a contrast merger or set of mergers
2. Assess the change in the system given your favorite
measure of functional load
3. Compare to a set of similar contrasts that did not
merge.
4. Is the change in the system smaller for the actual
mergers than for the non-mergers?
• Problem: if we assume that functional load is just
one of many factors influencing sound change, we
expect many ‘exceptions’ to the hypothesis.
 We need to assess outcomes statistically.
12
Functional load and diachronic
phoneme inventory change
1.
2.
3.
4.
What is the question?
What is the approach?
Find data/create materials
Analysis/Results
Strategy for dealing with data
sparseness, diversity of data source
1. Pool data on mergers from multiple languages.
2. Use linear mixed effects modeling.
– Random effects structure helps control for structure
inherent in different data-sources.
What’s the balance between
hypothesis generation and testing?
• Broad general hypothesis to be tested:
– Functional load predicts merger
• Narrower hypotheses to be explored:
– what specific measure(s) of functional load
are predictive?
Functional load and diachronic
phoneme inventory change
1.
2.
3.
4.
What is the question?
What is the approach?
Find data/create materials
Analysis/Results
Building a database
• Hockett 1955 “... unfortunately the
determination [of functional load] has not
been made yet [because] the amount of
counting and computation is formidable,
so we can give no example ...”
Use existing frequency corpora to build a
large database of reasonably recent
mergers and associated comparison sets.
17
Find word lists from a variety of languages
• We don’t know what measure of functional
load is appropriate: want to be able to test a
variety of measures
– Minimal pair count
– Average neighborhood density
– System entropy
• Requirements for each word list:
– Phonemically coded
– Lemmatized
– Frequency
Find word lists from a variety of languages
•
•
•
•
•
•
•
•
German
Dutch, RP English
American English
Spanish
French
Turkish
Korean
HK Cantonese
Material won’t perfectly
match your question
• Key!
– Always keep your eyes open for new data sources.
– Be ready to do some work to transform information
into a form appropriate for your question.
– You’ll often have to make semi-arbitrary decisions
• Keep notes, and be ready to describe/defend your
choices.
• Examples differing in ease:
– Turkish > American English > Spanish
Turkish: easy to work with
• Obtained by emailing authors
• Easy to work with:
– Orthographic coding already near-phonemic
• coding is pre-merger
– Morphologically parsed into stem + affixes
– Syntactic category given
– ArisoyTurkishData
– LemmaForms
American English: moderately
easy to work with
• Get standard US pronunciation from
Carnegie-Mellon Pronouncing Dictionary
(CMUDict)
• Frequency databases freely available
– CELEX, SubtlexUS
– How to deal with homographs?
• Example output files with ND calculated
– LemmaForms
Spanish: More complex
• Spanish Gigaword corpus (Linguistics Data
Consortium)
– Text files from newswires
– Example
• Use TreeTagger to morphologically parse
and add categories
• Example of output
• Map to phonemic representation and count
• Show code and output
Looking for changes of interest
• Look through the literature for
diachronically recent phoneme mergers in
varieties of these languages that share the
same phonemic inventory as the dialect on
which the word list is based.
– For example:
• American and RP English have distinct vowel
inventories;
• RP and Australian English share phoneme
inventories, even though they are phonetically
different.
Looking for changes of interest
• Identify a set of comparison phonemes of
the same major class (consonant, vowel)
as the merged phoneme pair that are
phonologically similar.
– 1 basic feature distant, e.g., t ~ d, t ~ k, u ~ o
56 mergers
524 non-mergers
8 languages
18 phoneme-pair systems:
Each contains at least one
merger, and as
comparisons, all other
phoneme pairs in the
same major class (vowel
or consonant) that are one
phonological feature apart.
Wedel, A., S. Jackson A. Kaplan
(2013). Cognition.
Wedel, A., A. Kaplan & S. Jackson
(2013). Language and Speech.
26
Independent measures
• Lexical measures:
– Number of minimal pairs distinguished by
each phoneme pair
• Write a script that goes through each phonemic
form, merges the contrast using a regular
expression, and counts how many other phonemic
forms it becomes identical to.
– Lemma vs word-form counts
– Within/across word category
Independent measures
• Lexical measures:
– Number of lexical ‘prefixes’ distinguished by
each phoneme pair (Cohen-Priva, in press)
– Average neighborhood density for words
containing each phoneme
– Lexical entropy change on merger (Surendran
& Niyogi 2006)
Calculating functional load in terms
of informational entropy (Shannon 1951)
General form (Hockett 1967, Surendran and Niyogi 2006):
FL(a ↔ b) = H(L) − H(La↔b)
H(L)
where
29
Independent measures
• Sublexical measures:
– Phoneme type/token frequencies
• uniphone, biphone, triphone
– Sublexical entropy change upon merger
– Dataset example
Functional load and diachronic
phoneme inventory change
1.
2.
3.
4.
What is the question?
What is the approach?
Find data/create materials
Analysis/Results
Number of minimal pairs is
inversely correlated with merger
Wedel, A., A. Kaplan & S. Jackson (2013). Language and Speech.
32
What kind of minimal pairs?
Lemma vs word form?
Within vs Between Category?
Frequency?
What does not seem to substitute for
minimal pairs in this effect?
• Lexical measures
– Neighborhood measures
– Lexical entropy change
• Sublexical measures
– sublexical entropy changes
– uniphone, biphone, triphone probabilities
Intriguing: Higher phoneme frequency is
positively correlated with merger
…but only for phoneme pairs that don’t
distinguish minimal pairs.
Example model predictions
American English
36
What about changes that might index
avoidance of merger?
• Phoneme Shift: concerted shift of a
phoneme pair in the same dimensional
space.
• Phoneme Split: merger of a contrast
associated with enhancement of an
associated contrast in a different
dimension.
Phoneme Shifts
- California Vowel shift
dude
ɪ
dress
u
ɛ
o
æ
fat
ɑ
Phoneme Splits
– Vowel length split in Pittsburgh English
• town ~ ton
taʊn ~ tʌn  tʌ:n ~ tʌn
ʊ
ton
ʌ
town
a
What’s the balance between
hypothesis generation and testing?
• We already have a strong prediction that a
small number of within-category minimal
lemma pairs predicts merger.
• Narrower hypothesis to be explored:
– Shifts and splits…
• which are phoneme inventory changes that
preserve lexical distinctions…
– are correlated with a significantly larger
number within-category minimal lemma pairs.
Get examples of shifts/splits in
our set of languages
• Shifts
– Spanish voiced/voiceless stop pairs
• Lewis 2000
– American English vowel shifts: Northern cities,
Southern Shift
• Labov et al. 2006
– NZ English front vowel shifts
• Hay, Macglagan, & Gordon 2008
– Polder Dutch diphthongs
• Jacobi 2009
– Canadian French vowel shift
• Walker 1983
Database of Shifts/Splits
• Splits
– Pittsburgh /ɑʊ ~ ʌ/, Inland North /e ~ ɑ/  vowel length
• Labov et al, 2006
– Turkish ɣ deletion  vowel length
• Lewis 1967
– NZE /dress ~ fleece/  diphthongization
• Maclagan and Hay, 2005
– Korean onsets /lax ~ aspirated/  tone
• Silva 2006
Mergers versus Shifts and Splits
phoneme splits/shifts
phoneme mergers
Can we predict the direction of change?
• Given a phoneme-inventory change, was it
– a change that reduces lexical distinctions?
 a merger
– a change that preserves lexical distinctions?
 a shift or a split
Given a change, predicting its type
Merger
Shift/Split
log minimal lemma pair count
Individual datasets
Am
Du
Fr
Ge
HK
Ko
RP
Sp
Tr
counts
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
0.0
2.5
5.0
7.5
0.0
2.5
5.0
7.5
0.0
2.5
log−transformed count of within−categor y minimal pairs
5.0
7.5
New insights
• The distribution of a phonological contrast
across the lexicon influences the trajectory of
change in that phonological contrast.
• Results in maintenance of a compact
phoneme inventory.
– Contrasts that support few lexical contrasts tend to
be lost.
– Contrasts that support more lexical contrasts are
preserved, or provide seed variation for new
contrasts.
Take home message with respect
to big data and computation…
• New data sources, models and technologies
allow us to better test hypotheses concerning
the relationship of the form of sound
systems to their function in communication.
2. Lexical competition and
hyperarticulation in natural speech
With: Becky Sharp
Department of Linguistics
University of Arizona
Lexical competition and
hyperarticulation in natural speech
1.
2.
3.
4.
What is the question?
What is the approach?
Find data/create materials
Analysis/Results
Big question
• If the existence of minimal pairs influences
change in a phoneme contrast, what are
the mechanisms, at various levels?
• Theoretical Prediction:
(e.g., Lindblom 1990, Wright 2004, Wedel 2012…)
 Phonetic cues that support communication
are hyperarticulated in usage.
 Consistent usage biases drive
phonological change and pattern formation.
Biases on
phonetic form
of word tokens
Change in distribution
of word variants
Change in distribution
of sublexical variants
across the lexicon
Articulation
Perception
Cognitive
Biasbiases
toward accurate transmission
Social factors
ofpatterns
lexical information
System-internal
Acquisition biases
…
Theoretical/Linguistic/Experimental evidence:
e.g., Baudouin de Courtenay 1895,
Ohala 1989, Lindblom 1990,
Bybee 2001, Blevins 2004, Baese-Berk &
Goldrick 2009, Ernestus 2011, Wedel et al. in press
Wang 1969, Bybee 2002,
Phillips 2006, Kraljic & Samuel 2009,
Hay and Maclagan 2012
52
[Within-phoneme category variants]
Selection for word-level contrast
/Phoneme category evolution/
Lexical competition and
hyperarticulation in natural speech
1.
2.
3.
4.
What is the question?
What is the approach?
Find data/create materials
Analysis/Results
Background/Previous Work
• Previous work done using lab speech
• Small effects, fragile results
– VOT is slightly hyperarticulated for initial stops
given minimal pairs in list reading (Baese & Goldrick
2009, Peramunage et al. 2011).
– VOT hyperarticulated on first production of words
with a visual stop-competitor in the context (Kirov &
Wilson 2012)
– In a lab-speech paradigm designed to elicit
hyperarticulation away from a vowel competitor,
tense/lax vowel duration differences increased,
but not formant differences (Schertz 2015)
Work with vowels has focused on
ND as the trigger for hyperarticulation,
and dispersion as the outcome
• Dispersion = distance of a vowel in F1-F2
space from the center of the vowel space
• But: vowel change patterns suggest that
competition-driven hyperarticulation
should be more phonetically specific.
– Correlation of minimal pair count with vowel
shift patterns link competition to shifts
– Vowel chain shifts often involve moves toward
the center of the vowel space.
Dispersion as the outcome of competition
makes the wrong prediction for vowel system
change: Vowels can centralize in chain-shifts.
American Northern Cities Shift
Ok, so how to approach this?
1. Use natural speech instead of lab speech
2. Compare minimal pair existence to
neighborhood density as a predictor for
hyperarticulation
3. Look at both VOT for stops and F1-F2
Euclidean distance for vowels
– For vowels, compare F1-F2 distance to
dispersion
Lexical competition and
hyperarticulation in natural speech
1.
2.
3.
4.
What is the question?
What is the approach?
Find data/create materials
Analysis/Results
Use the Buckeye Corpus of
Conversational Speech
•
•
•
•
40 one-hour sociolinguistic interviews
gender and age balanced
obtained in Columbus, Ohio in 2000
Densely annotated:
– Phonemic transcription
– Phonetic transcription
– Syntactic category
– Textfiles, phonefiles
VOT in word-initial stops
• Use a perlscript to identify appropriate
material in the Buckeye Corpus
– words starting with [ptkbdg]
– content words
– 1, 2 syllables
– no preceding/following utterance or disfluency
boundary
– no preceding word-final stop
• Measure closure and burst lengths
Stop length, burst and offset
[p]
[b]
burst
length
A
pea
A
bee
62
VOT data creation
• Annotate stop beginning, burst and offset
using Praat.
– Get lots of undergraduate helpers for this…
Praat example
Get dependent measure
• Script that processes Praat textgrid textfile
to obtain:
– Stop length, burst length
– Use burst/length ratio as a rate-normalized
measure of VOT (Yao 2007)
Get independent factors
of interest
• Minimal pair existence
– Carnegie Mellon Pronouncing Dictionary
• Neighborhood density
– Calculate independently
– IPhOD
Lexical competition and
hyperarticulation in natural speech
1.
2.
3.
4.
What is the question?
What is the approach?
Find data/create materials
Analysis/Results
Burst/length ratio by minimal pair existence:
Initial stops distinct in voice
pant
pat
badge
bat
No MinPairs
MinPairs
MinPairs
Voiced Stops
No MinPairs
Voiceless Stops
Burst/length ratio
Relationship of Neighborhood Density
to Burst/length Ratio
Lexical Neighbors
Factors in
Linear Mixed Effects modeling
• Stop-voicing minimal pair competitor existence
• Neighborhood density
• Control factors:
–
–
–
–
–
–
–
–
local speech rate
word category
forward/backward bigram probabilities
word frequency
previous mention
syllable number
stop identity
following high (liquids, rhotics, high vowels)
Get control factors
• From Buckeye word files:
– Word identity
• previous, target, and following word
– Word category
– Previous Mention
– Speech rate
• syllables per second in local utterance
Get control factors
• From corpora, get forward/backward bigram
probability
– Google n-gram
– Fisher English Training set
• see Seyfarth 2014 for example
Get control factors
• From IPhOD:
– SubtlexUS-based word frequency
– Neighborhood density
– Positional two-segment probability averaged
over the word (Vitevitch & Luce 2004)
Voiceless Stop Model
Voiced Stop Model
Can we find this effect in vowels?
Measuring vowel-vowel distances
Vowel distances in initial
syllables
• Identify material in the Buckeye corpus of
Conversational English
– words with an initial syllable non-back
monophthong
– content words
– 1 syllable
– no preceding/following utterance or disfluency
boundary
– no words with ablaut in their paradigm
• e.g., no ‘sit’, because of ‘sat’.
Dataset construction
For each word token,
measure vowel distance to
three neighboring vowels.
Starting dataset has three
measures per word token:
Split randomly into three
datasets with one measure
per token. Randomly
choose one dataset for
statistical analysis.
Minimal Pair existence
Measuring from [i]
i
ɪ
e
ɛ
æ
ʌ
Measuring from [ɛ]
i
ɪ
e
ɛ
æ
ʌ
Minimal pair
does not exist
Minimal pair exists
 more distinctive 
Factors
• Vowel-vowel minimal pair competitor existence
• Neighborhood density
• Vowel-vowel minimal pair competitor existence
in one of the other two neighboring vowels
• Control factors:
–
–
–
–
–
–
local speech rate
forward/backward bigram probabilities
word frequency
previous mention
vowel length
vowel-vowel pair identity
Measuring from [ɛ]
i
ɪ
e
ɛ
æ
ʌ
LME model
model = lmer (EuclideanDistance ~
MinimalPair+
Neighborhood+
Alternative +
VowelLength +
Vowel_CompetitorVowel +
(1+
MinimalPair+Neighborhood+Alternative+VowelLen
gth|Speaker) + (1|Lemma), data = k, REML = F)
Model output
What about dispersion?
• Run the same kind of analysis using
vowel-center distance.
• Factors that significantly predict
dispersion:
– Word Frequency
– Vowel Length
• Neighborhood density and minimal pair
competitor existence are not predictive.
Summary
• Phonetic cues that contribute strongly to
distinguishing words tend to be hyperarticulated
in natural speech
– VOT in initial stops
– F1-F2 distance in vowels
• Consistent with idea that phoneme contrast is
maintained in part by a bias toward lexical
contrast.
 maintains an efficient set of phoneme contrasts
over language change: phonemes that do not
distinguish many words are vulnerable to loss.
Download