Problem 10.19 Problem 10.21 Using correlation to see if paired Begin Chapter 11: The regression equation. Problem 10.19: Do reading and television viewing compete for leisure time? We have a random sample of 10 children with… X: Books read last year 0 7 2 1 5 4 3 3 0 1 Y: Hours TV watched per day 3 1 2 2 0 1 3 2 7 4 First, what are we interested in finding? (X is books/year, Y is TV/day) A) Is the mean of X less than a specific value? B) Is the mean of X less than the mean of Y? C) Does X decrease as Y increases? D) Is the proportion of X more than a specific value? So let’s do a correlation. We’re trying to find out if book reading decreases as television watching increases. In other words, we’re looking for a __________ correlation. That implies that any test for significance will be ______ tailed, also called _______ sided. Analyze Correlate Bivariate Protip: You can set the test to one-tailed in the pop-up window where you select variables. This just cuts the p-value in half for you. r = -.725, which is strong and negative. p-value = .009 We could reject the null hypothesis that books read and TV watched are uncorrelated, but.. are we done? r = -.725, which is strong and negative. p-value = .009 We could reject the null hypothesis that books read and TV watched are uncorrelated, but.. are we done? No, we should also check the ___________and maybe the histograms or residuals. A correlation requires a ______________ and ____________. With n=10, a histogram won’t tell us much. The scatterplot shows a downward trend. With so few points, it’s hard to tell if there’s a curve or just an influential point and an outlier. We can do a __________correlation as well, which handles curved relationships better. We come to the same conclusion: a strong negative correlation, definitely significant. Part of the problem with this data point is that the number of books read per year can’t go below zero. Following the trend, this person would read negative books if he/she could. We have a slight violation of normality. You can only read 0 books or more, but the normal curve continues into the negatives. Usually this part of the curve is so small we ignore it, but here it comes up. Without it, correlation might have been even stronger. One more thing: We’re using books per year, but TV per day? Does it matter that these are on different scales. TVyear is hours of television per year (TV *365, ignoring leapyears) There is no difference. Absolutely none. This is good. It means the correlation reflects the relationship between two variables regardless of the scale of measurement. Correlation (and t-scores, etc.) is unaffected by scale. The size of an object is the same if you measure it in meters or kilometers, so why should any conclusions about it change? Another quick one before we finish Chapter 10. Problem 10.21: Besides studying time, intelligence itself may be related to test performance. Find the partial correlation of studying time and exam grade, holding baseline skill constant. Alpha = 0.10 (higher than usual) X: Hours Studied 4 1 3 5 8 2 7 6 Y: Exam Grade 5 2 1 5 9 7 6 8 Z: Baseline skill 100 95 95 108 110 117 110 115 First, let’s look at the simple correlations. We find a positive correlation between grades and study time. Hours and skill look correlated, but with only six degrees of freedom, it could easily be a fluke (We’d see this strength 24% of the time) Exam grades and baseline skill are also strongly correlated. Is the grades payoff from study time just a side effect of a possible link between skill and time spent? For that we need the partial correlations. Simple correlation rXY = .683 Partial correlation rXY.Z = .634. Not much change. However, the correlation between study time and skill is now considered not insignificant because we’re down to 5 degrees of freedom. rXY = .683 (simple) rXY.Z = .634 (holding skill constant) The skill-time correlation wasn’t significant, so it’s not a surprize that it didn’t affect the grades-time correlation much. Study time is positively correlated to both skill and grades, so if there was a difference, we’d expect partial < simple. Mmmyes, an excellent chapter if I say so myself. Chapter 11: Regression With correlation, we can describe how strongly and in what direction there is a linear relationship between two variables. With regression we can describe __________ one variable increases or decreases as another one goes up. This is done by way of a ____________, which is the line that goes through the middle of the data in a scatter plot. The formula of the regression line, and any line, is the slopeintercept formula. X is the explanatory/independent variable. Y is the response/dependent variable. Which variable goes where is up to the question at hand. b is the slope, it’s defined by _____________. (IMPORTANT!!!) “For every 1 unit that X increases, Y increases by b units.” If the slope b is positive, then Y will increase by a positive amount when X increases. If the slope b is negative, then Y will increase by a negative amount when X increases. (In other words, Y __________when X increases when b is negative) Either the correlation r and the slope b are both positive or they’re both negative. If the correlation is significantly difference from zero, so is the slope. a is the __________. It’s the average value of Y when X is zero. This makes a lot more sense in a practical context. e stands for ________, or residual. It’s a measure of how far the line is from describing Y perfectly. Errors always has an average of zero (added over all points). Example 1: Books vs. Television. This the scatterplot of books read per year vs. TV watched per day. The regression line through it has the formula: = 4.701 – 0.841 X = 4.701 – 0.841 X The slope is -0.841. That means for every extra hour/day of TV watching, 0.841 less books are being read, on average. = 4.701 – 0.841 X The slope is -0.841, and the intercept is 4.701. That means when no TV is watched (the TV variable is zero), an average of 4.701 books are read per year. = 4.701 – 0.841 X There is no number for error shown because this is the average trend. is an estimate of the true response Y. The errors are the difference between what we estimate, and what we really get. e = - Y When we’re just looking at the trend, we ignore e. Example 2: Grades and study time. = 1.893 + 0.774X = 1.893 + 0.774X The slope is 0.774. That means for every 1 hour of study time, the average exam grade __________________ points. = 1.893 + 0.774X The slope is 0.774, and the intercept is 1.893. The means the average exam grade for someone _________ _________ _________ was 1.893. Monday: More on the intercept, regression SPSS, prediction and extrapolation. Wednesday: Midterm 2 review. (Mostly chapter 7) Friday: Midterm 2. DUN DUN DUN!