Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data Chapter Eleven Practice Problems Answer Key 1. I chose to find the correlation between the performance of students on the oral debate and their written report. I believe the written report is closely related to the topic of the debate. Research question: Is there a statistically significant relationship between a learner’s performance in a debate and the quality of the individual’s written work when writing on the debate topic? I’ve assigned the role of independent variable to performance on the debate and the role of dependent variable to quality of written report. Now I follow the steps in statistical logic. Step 1: State hypotheses H0: There is no statistically significant correlation statistically significant correlation between a learner’s performance in an oral debate and the quality of the individual’s written report on the debate topic. H1: There is a statistically significant correlation between a learner’s performance in an oral debate and the quality of the individual’s written report on the debate topic. Step 2. Set alpha alpha = .01 Step 3. Identify the appropriate statistic for the analysis I propose to analyze the data using Spearman rho or Kendall’s tau because: 1) the independent variable data is collected using a tool that yields rankable ordinal data; 2) the dependent variable data is collected using a tool that yields rankable ordinal data; 3) each observation is independent of from the others; 4) if there are tied scores within a variable, use Kendall’s tau. Step 4. Collect the data. The table presents all three variables—in this example I find the correlation between performance in the debate and performance on the written report. Data table Participant 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 debate 99 78 88 97 93 94 91 95 98 79 85 94 95 94 90 92 94 94 89 brief 94 80 92 96 94 92 90 94 96 83 83 92 92 96 94 94 92 90 76 report 98 86 93 95 94 98 90 94 95 84 89 95 95 95 92 87 93 92 77 1 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data 20 21 22 97 94 88 90 90 90 89 89 89 Step 5. Check the assumptions I proposed to analyze the data using Spearman rho or Kendall’s tau. These are the assumptions I need to check and I need to examine the data to see if there are tied scores. 1) the independent variable data is collected using a tool that yields rankable ordinal data; 2) the dependent variable data is collected using a tool that yields rankable ordinal data; 3) each observation is independent of from the others. From examining the data in the Table, I see that the outcomes are rankable so assumptions 1 and 2 are met. The debate and the written report are separate assignments and the scoring rubrics for the debate and the written report are entirely distinct so the third assumption is met, too. There are some tied data within each of the two variables, so I’ll use Kendall’s tau. Step 6. Calculate the observed value of the statistic I calculated and report descriptive statistics and just for fun, calculated the Shapiro Wilk statistic for each variable, made histograms for each, and made a scatterplot before calculating the observed value of Kendall’s tau. The R commands I used are presented below. > data = read.csv(file.choose(), header =T) > View (data) > summary (data$debate) Min. 1st Qu. Median Mean 3rd Qu. Max. 78.00 89.25 94.00 91.73 94.75 99.00 > summary (data$report) Min. 1st Qu. Median Mean 3rd Qu. Max. 77.00 89.00 92.50 91.32 95.00 98.00 > sd(data$debate) [1] 5.504819 > sd(data$report) [1] 4.912437 > par(mfrow = c(1,2)) > hist(data$debate, col = "deep sky blue", breaks =10) > hist (data$report, col = "dark salmon", breaks = 10) > shapiro.test (data$debate) Shapiro-Wilk normality test data: data$debate W = 0.8778, p-value = 0.01095 2 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data > shapiro.test (data$report) Shapiro-Wilk normality test data: data$report W = 0.9064, p-value = 0.03991 > plot(data$debate, data$report) > cor.test (data$debate, data$report, method = "kendall", exact =F) Kendall's rank correlation tau data: data$debate and data$report z = 3.4082, p-value = 0.0006538 alternative hypothesis: true tau is not equal to 0 sample estimates: tau 0.55661 3 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data Step 7. Calculate the exact probability of the statistic I simply retrieve the exact probability from the R output, so exact p = 0.0006538 Step 8. Compare the exact probability to alpha The rules for interpreting exact probability are: If exact probability ≥ alpha → accept the null hypothesis If exact probability < alpha → reject the null hypothesis The exact probability, p = 0.0006538, is less than alpha, .01, so reject the null hypothesis and accept the alternative hypothesis. H1: There is a statistically significant correlation between a learner’s performance in an oral debate and the quality of the individual’s written report on the debate topic. Step 9. Make the probability statement We can be 99% certain that there is a statistically significant correlation between a learner’s performance in an oral debate and the quality of the individual’s written report on the debate topic. Step 10. Interpret the meaningfulness There are two avenues for interpreting meaningfulness: 1) with reference to the research question, and 2) by calculating effect size. We discovered that there is a statistically significant relationship between learners’ performance in a debate and the quality of their written work when writing on the debate topic (Kendall’s tau = 0.55661; p = 0.0006538 ). Effect size is not typically calculated for Kendall’s tau, though Anglim [2012, cited in Turner (2014] indicates that the squared value of Kendall’s can be reported and interpreted as shared variance (shared variance = .30.) 2. Derek Yiu (2011) agreed to share his data from his research project. Here’s part of his abstract. Are you a blabbermouth? A mixed-method study of personality and oral classroom participation Ho Yin Yiu (Derek) Monterey Institute of International Studies Abstract In light of recent research on the role of personality in language learning, this study investigates the constructs of extroversion and introversion and their relationship to participation in the language classroom. Specifically, I explore the relationship between students’ self-perception of their own introversion and extroversion, and their selfreported oral classroom participation. Quantitative data were collected by means of a questionnaire. Participants included 42 native speakers of English who have studied, or were studying, a foreign language. For the explanation below I imported the dataset from the Companion Website. I give a summary of the R commands I used in Step 6 below. 4 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data Step 1: State hypotheses H0: There is no statistically significant correlation between language learners’ perception of their degree of introversion/extroversion and their self-reported amount of participation in language classes. H1: There is a statistically significant correlation between language learners’ perception of their degree of introversion/extroversion and their self-reported amount of participation in language classes. Step 2. Set alpha alpha = .05, because the research is exploratory. Step 3. Identify the appropriate statistic for the analysis I propose to analyze the data using one of the correlation statistics, Pearson’s r, Spearman rho, or Kendall’s tau. Derek’s data collection tool may yield normally distributed data—if the data are normally distributed I’ll use Pearson’s r. If they are not normally distributed, I’ll use Spearman rho or Kendall’s tau depending on whether there are tied scores within the data. 1) the independent variable data is collected using a tool that yields data which may be normally distributed; 2) the dependent variable data is collected using a tool that yields data which may be normally distributed; 3) each observation is independent of from the others; 4) the relationship between the two variables is linear; 4) if there are tied scores within a variable, I will use Kendall’s tau. Step 4. Collect the data. The data can be retrieved from Companion Website or entered directly into R from the table presented in the problem. If you import the dataset, note that the column with the header vert includes the introversion-extroversion scores and the column with the header part includes the self-reported participation scores. Step 5. Check the assumptions 1) the independent variable data is collected using a tool that yields data which may be normally distributed; 2) the dependent variable data is collected using a tool that yields data which may be normally distributed; 3) each observation is independent from the others; 4) the relationship between the two variables is linear. The histograms and the outcomes of the Shapiro Wilk analyses indicate that the data for each variable approximate a normal distribution, so assumptions 1 and 2 are met (the R commands are presented in Step 6 below). The two tools used to the collect the data (the introversion/extroversion survey and the self-reported participation survey) are completed distinct from one another, so the 3rd assumption is met. The scatterplot shows that the relationship between the two variables is linear (see the scatterplot below), so the 4th assumption is met, too. Step 6. Calculate the observed value of the statistic The R commands I used to calculate the descriptive statistics, make histograms, make a scatterplot, and calculate the correlation are presented below. > derek.data = read.csv(file.choose(), header =T) > View (derek.data) > summary (derek.data$vert) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 4.215 5.335 5.423 6.453 8.070 5 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data > summary(derek.data$part) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.33 5.00 5.83 5.73 6.50 9.00 > sd(derek.data$vert) [1] 1.390546 > sd(derek.data$part) [1] 1.567391 > par(mfrow = c(1,2)) > hist(derek.data$vert, col = "medium spring green", breaks = 10) > hist(derek.data$part, col = "light slate blue", breaks = 10) > shapiro.test(derek.data$vert) Shapiro-Wilk normality test data: derek.data$vert W = 0.9644, p-value = 0.2121 > shapiro.test(derek.data$part) Shapiro-Wilk normality test data: derek.data$part W = 0.9702, p-value = 0.3345 > plot(derek.data$vert, derek.data$part) > cor.test(derek.data$vert, derek.data$part) Pearson's product-moment correlation data: derek.data$vert and derek.data$part t = 21.7879, df = 40, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9270025 0.9786407 sample estimates: cor 0.9603577 6 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data Step 7. Calculate the exact probability of the statistic I simply retrieve the exact probability from the R output; p-value < 2.2e-16 (which is .00000000000000022) 7 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data Step 8. Compare the exact probability to alpha The rules for interpreting exact probability are: If exact probability ≥ alpha → accept the null hypothesis If exact probability < alpha → reject the null hypothesis The exact probability, p = .00000000000000022, is less than alpha, .01, so reject the null hypothesis and accept the alternative hypotheses. H1: There is a statistically significant correlation between language learners’ perception of their degree of introversion/extroversion and their self-reported amount of participation in language classes. Step 9. Make the probability statement We can be 95% certain that there is a statistically significant correlation between language learners’ perception of their degree of introversion/extroversion and their self-reported amount of participation in language classes. Step 10. Interpret the meaningfulness There are two avenues for interpreting meaningfulness: 1) with reference to the research question, and 2) by calculating effect size. We can be 95% certain that there is a statistically significant correlation between language learners’ perception of their degree of introversion/extroversion and their self-reported amount of participation in language classes (Pearson’s r = .96; p = .00000000000000022). The effect size is strong (r2 = .92). Because the null hypothesis was rejected, one can estimate (or predict) someone’s (self-reported) level of class participation based on that person’s self-reported introversion/extroversion. The following discussion is an explanation of how to calculate estimated or predicted y using R. The predictor variable is the x-variable; the variable to be predicted is the y-variable. The order in which you place the variables in an R command is important! In this example, the dependent variable (y) is self-reported participation. The independent variable (x) is introversion/extroversion. The lm command derives from the idea of “linear modeling”. Given that there’s a statistically significant linear relationship between introversion/extroversion and self-reported participation in class, you can estimate what a person’s level of participation would be given that individual’s reported introversion/extroversion. You can see this in the scatterplot (below)—someone who has an introversion/ extroversion score of about 5.25 is likely to have participation between 4.75 or so and 6 or so. Linear modeling allows calculation of the “estimated” or predicted value of Y, which is represented as Y . Calculating the predicted value is more precise than “eye-balling” the scatterplot. The formula for Y is Y Y b X X , where b is the slope of the line. R uses a different formula, though both the formula for calculating formula R uses are derived from the same principle. 8 Y with a calculator and the Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data The commands for doing linear regression are presented in the grey box below. The lm command indicates that I want to predict a value > lm(derek.data$part~ derek.data$vert) for y (participation) from an individual’s performance on x (introversion/extroversion). The res command allows retrieval of the results. So this > res = lm(derek.data$part~derek.data$vert) command “says” save the results of doing linear modeling of y on x. Typing ‘res’ gives you the results of the linear modeling, >res two coefficients: beta 1 (the intercept) and beta 2 (the Output slope). Call: lm(formula = derek.data$part ~ derek.data$vert) Coefficients: 9 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data (Intercept) derek.data$vert -0.141 1.082 > plot (derek.data$vert, derek.data$part) > abline (res) The output looks like this: Make a scatterplot again: Then make a line from ‘a’ to ‘b’ using the abline command. This command recalls the “betas”, which are needed in the formula R uses to estimate (predict) y. Typing “betas” presents the values. > betas = coef (res) This command calculated the estimated value of y for x = 5.25. (The formula says: Get the beta values. Then multiple beta 1 (the intercept) by 1 and multiple beta 2 (the slope) by the value of x that we want to predict from (5.25)—and then add these two products. The sum is the estimated (predicted) level of self-reported participation for a person who has a self-reported introversion/extroversion score of 5.25. > betas (Intercept) derek.data$vert -0.1409579 1.0824929 > sum (betas * c(1,5.25)) [1] 5.54213 To calculate a confidence interval for Y , you can follow the description in the chapter for calculating the standard error of the estimate, SEE. The standard deviation for y, participation, is 1.567391 and the mean is 5.74. The correlation is .96. The formula for SEE is: SEE S y 1 rxy2 = 1.567391 1 .962 = 1.567391 1 .92 = 1.567391 .08 = (1.567391)(.2828) = .44 points To determine the 68% confidence band for Y (which is 5.54213), find the low end of the confidence band by Y and find the high end of the confidence band by adding SEE to Y . [( Y - SEE) for the subtracting SEE from lower boundary and ( Y + SEE) for the upper boundary.] 10 Turner Answer Key for Chapter Eleven 5 2015 Using statistics in small-scale language education research: Focus on non-parametric data I can be 68% confident that a person whose self-reported level of introversion/extroversion is 5.25 will have a selfreported level of participation between 4.81 and 5.69. Bibliography Yiu, D. (2011). Are you a blabbermouth? A mixed-method study of personality and oral classroom participation. (Unpublished paper, Graduate School of Translation, Interpretation, and Language Education, Monterey Institute of International Studies, Monterey, CA.) 11