Wedel Final Report

advertisement
LINEAR MIXED MODEL FOR VOWEL DISTANCES
Ziyue Jia, Nicholas Lytal, Min Zhu
12th October 2015
1 EXECUTIVE SUMMARY
Andrew Wedel PhD., Linguistics, consulted StatLab to evaluate his linear mixed effects
model for vowel distance. He wished to confirm whether his model was satisfactory and
clarify some of the steps involved.
2 DATA COLLECTION
The data Dr. Wedel provided was from a freely available data set not specifically
designed for this study. The sample consists of 40 working-class individuals from Ohio
that were asked to talk about local affairs for one hour each in order to record natural
conversation. Directly asking individuals to compare vowels sounds tends to lead
subjects to speak unnaturally which is undesirable and could produce bias.
3 BACKGROUND
In the English language, many spoken words have minimal pairs—words that differ
slightly by their vowel sound—such as “cot” and “caught.” Such words may be
pronounced identically in some dialects and differently in others. In addition, some
dialects are more cautious in distinguishing minimal pairs in their pronunciation.
Dr. Wedel’s hypothesis is that when speakers pronounce words that have a minimal pair,
they differentiate the vowel sounds further than they would for words without a minimal
pair. For instance, “jet” has no minimal pair with the “ɪ” vowel (there is no “jit” word),
but “bet” does, with “bit.” Therefore, the “ε” and “ɪ” vowels in “bet” and “bit” should be
pronounced more differently (and further apart) than “ε” and “ɪ” in unrelated words such
as “set” and “mitt”.
All analysis was completed with R.
4 METHODS USED
Before our meeting, Dr. Wedel fitted his data into a linear mixed effects model that
combines several of the most important factors in vowel separation as well as other
covariates that may affect speech according to literature and common practice in
linguistics. Since the existence of minimal pairs may vary based on the speaker, the
transcriber determined whether each subject made these distinctions. Dr. Wedel decided
on the first model that includes the following important factors as random effects:




Lemma: The words used by the speakers in their conversation. Different words in
the same vowel comparison may have different effects. (Categorical)
IMinPairExist2: Whether or not a minimal pair exists between the token vowel
and the specific vowel being compared. (Binary)
SumAlt: Whether or not one minimal pair exists for the token vowel OTHER than
the one being compared. (Binary)
IND: The neighborhood index, or the number of words very similar to the word in
question (for instance, pet, pend, and peck are all part of the neighborhood index
for pen). (Discrete)
Also included in the model are the following variables:











sILemFreq: A measure of how commonly the word (lemma) was spoken.
(Continuous)
SIBiphone: A variable related to each individual word that is related to biphones
for that word. (Continuous)
Gender: Gender of the speaker (Binary)
Vow_CompVow: Vowels being compared (Categorical).
sVowelLength: A perceived duration of a vowel sound. (Continuous)
NewSpeechRate: The speech rate for each speaker. (Continuous)
sForwardBigram: (Continuous)
sBackwardBigram: (Continuous)
Age: Whether the speaker is older than 40 or younger than 40. (Y=Above 40,
O=Below 40) (Binary)
Prevmen: Binary
Cat: Category of lemma (J: adjective, N: Noun, R: Adverb, V: Verb)
(Categorical)
By linguistics study’s convention, all discrete and continuous variables are centered
(subtracted by its mean), and shown as ‘c.(Variables)’ in R.
After getting the R output for the first model, Dr. Wedel removed some insignificant
(p>0.05) fixed effect variables: sVowelLength, NewSpeechRate, sForwardBigram,
sBackwardBigram, Age, Prevmen, and Cat. Then Dr. Wedel refitted model again with
the significant fixed effect variables and all random effect variables.
Vowels were mapped by sound frequency (Hz) to a two-dimensional formant plot (see
example next page). This leads to areas of vowel densities that may be mapped against
one another to produce a Euclidian distance. Each of the vowels such as “ε” and “ɪ” were
measured against its three nearest neighbor vowels according to literature. Since some
words, vowels, and vowel pairs are more common than others, each vowel was mapped
to one of its three nearest neighbors in a manner so that one third of all measurements
were to each of the three closest neighbors.
(http://www.ling.upenn.edu/courses/cogs501/Hillenbrand.html)
5 RESULTS AND DISCUSSION
Dr. Wedel had a few specific requests:
1. Analyze model and determine whether structure is sound and results are meaningful.
R output for the first model:
Model:
BarkEuclidDistijk = b00 + b10 * MinPairExist2i + b20 * SumAlti + b30 *c.(IND)i +
 1i * MinPairExist 2 i   2i * SumAlt i   3i * c.( IND) i  fix _ effect   0i   0 j   ijk
 0 j ~ N (0,  20 ) ,  1 j ~ N (0, 21 ) ,  2 j ~ N (0, 22 ) ,  3 j ~ N (0,  23 ) ,  0 j ~ N (0, 20 ) ,
 ijk ~ N (0,  2 )
β00 is the intercept.
𝛽̂00= 2.360.
β10 is the coefficient of the fixed slope of Minpair=T.
𝛽̂10=0.185
β20 is the coefficient of the fixed slope of SumAlt=T.
𝛽̂20=-0.207
β30 is the coefficient of the fixed slope of ‘IND’.
𝛽̂30=0.000115
i=1….Number of speakers:
γ0i is the random intercept of Speaker i.
𝛾̂0𝑖 ~N(0, 0.0473)
j=1….Number of Lemma.
τ0j is the random intercept of Lemma.
τ̂0𝑗 ~N(0, 0.0921)
γ1i is the estimated coefficient of the random effect slope of Minpair=T.
𝛾̂1𝑖 ~N(0, 0.0175)
γ2i is the estimated coefficient of the random effect slope of SumAlt=T.
𝛾̂2𝑖 ~N(0, 0.0248)
γ3i is the estimated coefficient of the random effect slope of ‘IND’.
𝛾̂3𝑖 ~N(0, 0.0000358)
εijk~N(0, 0.455)
Other fixed effect variables will not be elaborated in this report.
After removing insignificant (p>0.05) fixed effect variables:
R output:
After fitting the model, the assumptions of linear mixed model should be checked.
Normality: All 6 residual error terms (γ1i, γ2i, γ3i, γ0i, γ0j, εij) should be normally
distributed with mean 0. Normality can be diagnosed by plotting standardized residuals
against their normal scores. The plot should have a straight diagonal line if it is normally
distributed. (Hox, 2010)
Homoscedasticity (Constant variance): It means that the variance of these 6 residual error
terms should be independent with its corresponding values of the covariate.
Homoscedasticity can be diagnosed by plotting residuals against its predicted value. If it
is randomly distributed, then homoscedasticity is valid. (Hox, 2010)
It is possible that there are some interactions between some covariates. For example,
there may be a significant interaction between lemma and speaker or between SumAlt
and Minimal pairs. If there is a significant interaction between lemma and speakers, it
means even within the same lemma, the variance of the coefficient of lemma could be
different by different speakers after adjusting other variables. Likelihood ratio test could
be used to test whether an interaction is significant or not in a model.
Since minimal pairs is the variable that Dr. Wedel is interested in, it is not necessary to
remove the insignificant fixed effect variables. Likelihood ratio test is better to be
performed to test whether the primary interested variable--minimal pairs is significant in
the model. The model with minimal pairs term will be the full model, and the model
without minimal pairs term will be the reduced model. Their -2(log-likelihood values)
can be found in R output. The difference between the -2(log-likelihood values) of
reduced model and -2(log-likelihood values) of full model follows Chi-square
distribution with 1 degree of freedom. For example, if the p-value from the test is less
than 0.05, then the minimal pairs term is significant at 0.05 significant level. In other
words, minimal pairs has significant effect to the vowel distance at 0.05 significant level.
2. Examine correlation matrices related to the model and data.
R output for Random Effects
Random effects:
Groups
Name
Lemma
(Intercept)
Speaker (Intercept)
IMinPairExist2TRUE
SumAltTRUE
c.(IND)
Residual
Variance
0.0920629
0.0472761
0.0174606
0.0247814
0.0000358
0.4546348
Std.Dev. Corr
0.303419
0.217431
0.132138 -0.03
0.157421 -0.13 -0.94
0.005983 0.27 0.44 -0.72
0.674266
Correlation matrix:
Intercept
Minimal pairs=T
SumAlt=T
c.(IND)
Intercept
1
-0.03
-0.13
0.27
Minimal Pairs=T
SumAlt=T
c.(IND)
1
-0.94
0.44
1
-0.72
1
The correlation matrix produced for the random effects model in R refers to the
correlation between any two coefficients among the random slopes and random intercept
of speakers. For instance, a correlation of -0.03 for the random slope coefficient of
minimal pairs and the speaker’s random intercept coefficient can be interpreted as that
every unit of standard deviation increase in the speaker’s random intercept leading to a
0.03 unit of standard deviation decrease in the minimal pairs’ random slope coefficient.
In other words, if a speaker already pronounces two vowels very differently, whether or
not a minimal pair exists between them will not have as strong an effect as if the speaker
pronounced the two vowels very similarly.
The current study focuses on the magnitude of the Euclidean distance between vowel
pairs. However, this method does not consider situations where two vowel pairs have the
same distance but different orientations. Future studies may benefit from including the
orientation of the vowels alongside the distances, in which case a two-dimensional vector
could be introduced into the model to represent the orientation.
Download