This time: A correction, ANOVA with review, student reviews. Assignment 4 is marked and in the workshop. Correction from Monday’s lecture. A more detailed account of why this is so is in Wk13_Extra on webpage, but it’s for interest only. What I said: The proportion of variation explained is just like the r-squared from correlation. In fact, if I took the correlation of marriage number and marriage length and squared it, I should get the same proportion. What I neglected to say: I will get the same amount of variance that is explained THE REGRESSION. BY The regression and ANOVA are two different ways to model the data, so depending on the data one will explain a lot more of the variance than the other. If the independent variable is nominal (groups), ANOVA will do much better. It will explain more variance than a regression. If the data is independent variable is interval (X), regression will do better. In the caffeine.sav data, ANOVA beats regression hands down. ANOVA r-squared 235/281 = 0.836, 83.6% variance explained Regression r-squared 16/281 = 0.057, 5.7% variance explained But in the marriage data, number of previous marriages could be interpreted as interval. This fact and the linear decrease in marriage length as previous marriages increases means that regression is ALSO a useful tool to analyze the data. In short, both ANOVA and regression were appropriate for the marriage data. We would expect to get similar results from both. Regression explains 80.3% ANOVA explains 80.6% Too long, didn’t read: When I said the proportion of variance explained is exactly rsquared… Proportion of variance explained in a regression is the rsquared for regression. Proportion of variance explained in ANOVA is the r-squared for ANOVA. You can compare these to help determine which is better. That they matched for the marriage example was a freak event Thanks for bearing with me sandwiching that into the lecture. (No dragons harmed in this DLT) New example: Consider the dataset Ch8_24.sav, which has a list of patients with a panic disorder including - The type of treatment they are receiving. (Behavioural, Cognitive, or Medication) - The response from 0 (no help) to 10 (helping completely) to the treatment. Tossing ethics aside, let’s assume that the assignment of therapy to person was random. For cost reasons, the medication group is larger than the other two. This is called an ____________ design. The response/dependent variable is ordinal (0-10 scale), but we’ll treat it like it’s interval because the points on the scale could be assumed to be evenly spaced apart. The explanatory/independent variable is nominal (type of treatment). It has no natural ordering, so we can only treat it as nominal. What should we do? Our toolbox: Normal z-test One sample t-test Two sample t-test Correlation Regression Chi-Squared Odds Ratio ANOVA Our toolbox: Normal z-test One group only, needs known σ One sample t-test One group only Two sample t-test Two groups only Correlation Regression Chi-Squared Odds Ratio ANOVA Our toolbox: Normal z-test One group only, needs known σ One sample t-test One group only Two sample t-test Two groups only Correlation Needs interval explanatory Regression Needs interval explanatory Chi-Squared Odds Ratio ANOVA Our toolbox: Normal z-test One group only, needs known σ One sample t-test One group only Two sample t-test Two groups only Correlation Needs interval explanatory Regression Needs interval explanatory Chi-Squared Needs nominal response Odds Ratio Needs nominal response ANOVA This one There is one more requirement of ANOVA, _________ standard deviation. We can check this subjectively but looking at a scatterplot. They appear to be spread about the same amount, so the assumption that the standard deviations are the same is reasonable. However, since all the values are whole numbers, the scatterplot can be hiding something: multiple cases with the same value. There’s no way to tell how many cases each of these dots represents. There could be any number of cases that with a response of “5”. Option one: Another visualization. Does anyone else remember the __________________? The boxplot gives us a picture of a measure of spread, the _______________, the range between Q1 and Q3. The height of each box is half the data, and no box is much larger than any other. The boxes are close to the same height; also none of the categories have tons of outliers. So there’s little evidence that the true standard deviations are different. Side-by-side boxplots give information that scatterplots can’t, they’re also very useful when the groups have _________. Option two: Look at the sample standard deviation. We can find it and other info in the summary statistics. All the standard deviations are between 1.68 and 1.81, again no evidence of heteroscedastisity. Let’s actually do the ANOVA, with a default alpha 0.05. Is there a significant difference between the means? No. Sig., the p-value is .204. P-value > Alpha, so we fail to reject to null (that all the true means are the same). The sample means aren’t different enough to say that that the population means are different. How much of the variation in response to therapy is explained by the type of treatment? Only 10.426 / 84.074 = 0.124, or 12.4% of the variation is explained. Knowing the group would help very little if at all in predicting the response to treatment. Some ending notes about this problem: Having one group larger than the others (13 cases for medication, 7 cases in the behavioural and cognitive), didn’t cause any problems. Like t-tests and chi-squared tests, really small groups have their own issues, but just because they’re small, not because they’re a different size than some other group. Only having groups with different amounts of variation and oddities like outliers are a problem (we’ll see one on Friday). Early ending for student evaluations. Next time: ANOVA examples, course wrap-up. For reference (not on final): Boxplots made by : Graphs Legacy Dialogs Boxplots Summaries for groups of cases. Summary stats made by: Analyze Descriptive Stats Explore Put “Response to Therapy” in Dependent List Put “Name of Therapy” in Factor List