LINEAR MIXED MODEL FOR VOWEL DISTANCES Ziyue Jia, Nicholas Lytal, Min Zhu 12th October 2015 1 EXECUTIVE SUMMARY Andrew Wedel PhD., Linguistics, consulted StatLab to evaluate his linear mixed effects model for vowel distance. He wished to confirm whether his model was satisfactory and clarify some of the steps involved. 2 DATA COLLECTION The data Dr. Wedel provided was from a freely available data set not specifically designed for this study. The sample consists of 40 working-class individuals from Ohio that were asked to talk about local affairs for one hour each in order to record natural conversation. Directly asking individuals to compare vowels sounds tends to lead subjects to speak unnaturally which is undesirable and could produce bias. 3 BACKGROUND In the English language, many spoken words have minimal pairs—words that differ slightly by their vowel sound—such as “cot” and “caught.” Such words may be pronounced identically in some dialects and differently in others. In addition, some dialects are more cautious in distinguishing minimal pairs in their pronunciation. Dr. Wedel’s hypothesis is that when speakers pronounce words that have a minimal pair, they differentiate the vowel sounds further than they would for words without a minimal pair. For instance, “jet” has no minimal pair with the “ɪ” vowel (there is no “jit” word), but “bet” does, with “bit.” Therefore, the “ε” and “ɪ” vowels in “bet” and “bit” should be pronounced more differently (and further apart) than “ε” and “ɪ” in unrelated words such as “set” and “mitt”. All analysis was completed with R. 4 METHODS USED Before our meeting, Dr. Wedel fitted his data into a linear mixed effects model that combines several of the most important factors in vowel separation as well as other covariates that may affect speech according to literature and common practice in linguistics. Since the existence of minimal pairs may vary based on the speaker, the transcriber determined whether each subject made these distinctions. Dr. Wedel decided on the first model that includes the following important factors as random effects: Lemma: The words used by the speakers in their conversation. Different words in the same vowel comparison may have different effects. (Categorical) IMinPairExist2: Whether or not a minimal pair exists between the token vowel and the specific vowel being compared. (Binary) SumAlt: Whether or not one minimal pair exists for the token vowel OTHER than the one being compared. (Binary) IND: The neighborhood index, or the number of words very similar to the word in question (for instance, pet, pend, and peck are all part of the neighborhood index for pen). (Discrete) Also included in the model are the following variables: sILemFreq: A measure of how commonly the word (lemma) was spoken. (Continuous) SIBiphone: A variable related to each individual word that is related to biphones for that word. (Continuous) Gender: Gender of the speaker (Binary) Vow_CompVow: Vowels being compared (Categorical). sVowelLength: A perceived duration of a vowel sound. (Continuous) NewSpeechRate: The speech rate for each speaker. (Continuous) sForwardBigram: (Continuous) sBackwardBigram: (Continuous) Age: Whether the speaker is older than 40 or younger than 40. (Y=Above 40, O=Below 40) (Binary) Prevmen: Binary Cat: Category of lemma (J: adjective, N: Noun, R: Adverb, V: Verb) (Categorical) By linguistics study’s convention, all discrete and continuous variables are centered (subtracted by its mean), and shown as ‘c.(Variables)’ in R. After getting the R output for the first model, Dr. Wedel removed some insignificant (p>0.05) fixed effect variables: sVowelLength, NewSpeechRate, sForwardBigram, sBackwardBigram, Age, Prevmen, and Cat. Then Dr. Wedel refitted model again with the significant fixed effect variables and all random effect variables. Vowels were mapped by sound frequency (Hz) to a two-dimensional formant plot (see example next page). This leads to areas of vowel densities that may be mapped against one another to produce a Euclidian distance. Each of the vowels such as “ε” and “ɪ” were measured against its three nearest neighbor vowels according to literature. Since some words, vowels, and vowel pairs are more common than others, each vowel was mapped to one of its three nearest neighbors in a manner so that one third of all measurements were to each of the three closest neighbors. (http://www.ling.upenn.edu/courses/cogs501/Hillenbrand.html) 5 RESULTS AND DISCUSSION Dr. Wedel had a few specific requests: 1. Analyze model and determine whether structure is sound and results are meaningful. R output for the first model: Model: BarkEuclidDistijk = b00 + b10 * MinPairExist2i + b20 * SumAlti + b30 *c.(IND)i + 1i * MinPairExist 2 i 2i * SumAlt i 3i * c.( IND) i fix _ effect 0i 0 j ijk 0 j ~ N (0, 20 ) , 1 j ~ N (0, 21 ) , 2 j ~ N (0, 22 ) , 3 j ~ N (0, 23 ) , 0 j ~ N (0, 20 ) , ijk ~ N (0, 2 ) β00 is the intercept. 𝛽̂00= 2.360. β10 is the coefficient of the fixed slope of Minpair=T. 𝛽̂10=0.185 β20 is the coefficient of the fixed slope of SumAlt=T. 𝛽̂20=-0.207 β30 is the coefficient of the fixed slope of ‘IND’. 𝛽̂30=0.000115 i=1….Number of speakers: γ0i is the random intercept of Speaker i. 𝛾̂0𝑖 ~N(0, 0.0473) j=1….Number of Lemma. τ0j is the random intercept of Lemma. τ̂0𝑗 ~N(0, 0.0921) γ1i is the estimated coefficient of the random effect slope of Minpair=T. 𝛾̂1𝑖 ~N(0, 0.0175) γ2i is the estimated coefficient of the random effect slope of SumAlt=T. 𝛾̂2𝑖 ~N(0, 0.0248) γ3i is the estimated coefficient of the random effect slope of ‘IND’. 𝛾̂3𝑖 ~N(0, 0.0000358) εijk~N(0, 0.455) Other fixed effect variables will not be elaborated in this report. After removing insignificant (p>0.05) fixed effect variables: R output: After fitting the model, the assumptions of linear mixed model should be checked. Normality: All 6 residual error terms (γ1i, γ2i, γ3i, γ0i, γ0j, εij) should be normally distributed with mean 0. Normality can be diagnosed by plotting standardized residuals against their normal scores. The plot should have a straight diagonal line if it is normally distributed. (Hox, 2010) Homoscedasticity (Constant variance): It means that the variance of these 6 residual error terms should be independent with its corresponding values of the covariate. Homoscedasticity can be diagnosed by plotting residuals against its predicted value. If it is randomly distributed, then homoscedasticity is valid. (Hox, 2010) It is possible that there are some interactions between some covariates. For example, there may be a significant interaction between lemma and speaker or between SumAlt and Minimal pairs. If there is a significant interaction between lemma and speakers, it means even within the same lemma, the variance of the coefficient of lemma could be different by different speakers after adjusting other variables. Likelihood ratio test could be used to test whether an interaction is significant or not in a model. Since minimal pairs is the variable that Dr. Wedel is interested in, it is not necessary to remove the insignificant fixed effect variables. Likelihood ratio test is better to be performed to test whether the primary interested variable--minimal pairs is significant in the model. The model with minimal pairs term will be the full model, and the model without minimal pairs term will be the reduced model. Their -2(log-likelihood values) can be found in R output. The difference between the -2(log-likelihood values) of reduced model and -2(log-likelihood values) of full model follows Chi-square distribution with 1 degree of freedom. For example, if the p-value from the test is less than 0.05, then the minimal pairs term is significant at 0.05 significant level. In other words, minimal pairs has significant effect to the vowel distance at 0.05 significant level. 2. Examine correlation matrices related to the model and data. R output for Random Effects Random effects: Groups Name Lemma (Intercept) Speaker (Intercept) IMinPairExist2TRUE SumAltTRUE c.(IND) Residual Variance 0.0920629 0.0472761 0.0174606 0.0247814 0.0000358 0.4546348 Std.Dev. Corr 0.303419 0.217431 0.132138 -0.03 0.157421 -0.13 -0.94 0.005983 0.27 0.44 -0.72 0.674266 Correlation matrix: Intercept Minimal pairs=T SumAlt=T c.(IND) Intercept 1 -0.03 -0.13 0.27 Minimal Pairs=T SumAlt=T c.(IND) 1 -0.94 0.44 1 -0.72 1 The correlation matrix produced for the random effects model in R refers to the correlation between any two coefficients among the random slopes and random intercept of speakers. For instance, a correlation of -0.03 for the random slope coefficient of minimal pairs and the speaker’s random intercept coefficient can be interpreted as that every unit of standard deviation increase in the speaker’s random intercept leading to a 0.03 unit of standard deviation decrease in the minimal pairs’ random slope coefficient. In other words, if a speaker already pronounces two vowels very differently, whether or not a minimal pair exists between them will not have as strong an effect as if the speaker pronounced the two vowels very similarly. The current study focuses on the magnitude of the Euclidean distance between vowel pairs. However, this method does not consider situations where two vowel pairs have the same distance but different orientations. Future studies may benefit from including the orientation of the vowels alongside the distances, in which case a two-dimensional vector could be introduced into the model to represent the orientation.