Chapter 8 Review, Part 2

Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions. 1 You can use the regression equation when: 1. the relationship between X and Y is linear, 2. r falls outside the CI.95 around 0.000 and is therefore a statistically significant correlation, and 3. X is within the range of X scores observed in your sample, 2 Simple problems using the regression equation tY' = r * tX tY' = .150 * 0.40 = 0.06 tY' = .40 * -1.70 = -0.68 tY' = .40 * 1.70 = 0.68 3 Predictions from Raw Data 1. Calculate the t score for X. t X  ( X  X ) / sX 2. Solve the regression equation. tY   r (t X ) 3. Transform the estimated t score for Y into a raw score. Y   Y  (tY  ) * ( sY )  4 Predicting from and to raw scores Problem: Estimate the midterm point total given a study time of 400 minutes. It is given that the estimated mean of the study time is 560 minutes and the estimated standard deviation is 216.02. (Range = 260-860) It is given that the estimated mean of midterm points is 76 and their estimated standard deviation is 7.98. The estimated correlation coefficient is .851. 5 Predicting from and to raw scores 1. Translate raw X to tX score. X X-bar sX (X-X-bar) / sX = tX 400 560 216.02 (400-560)/216.02= -0.74 6 Use regression equation 2. Find value of tY' r r * tX = tY' .851 .851*-0.74=-0.63 7 Translate tY' to raw Y' Y sY 76.00 7.98 Y + (tY' * sY) = Y' 76.00+(-0.63*7.98) = 70.97 8 A Caution Never assume that a correlation will stay linear outside of the range you originally observed. Therefore, never use the regression equation to make predictions from X values outside of the range you found in your sample. Example: Measuring heights of children. 9 Reviewing the r table and reporting the results of calculating r from a random sample 10 How the r table is laid out: the important columns Column 1 of the r table shows degrees of freedom for correlation and regression (dfREG) dfREG=nP-2 Column 2 shows the CI.95 for varying degrees of freedom Column 3 shows the absolute value of the r that falls just outside the CI.95. Any r this far or further from 0.000 falsifies the hypothesis that rho=0.000 and can be used in the regression equation to make predictions of Y scores for people who were not in the original sample but who were part of the population from which the sample is drawn. 11 df nonsignificant .05 .01 If r falls in.9999 within the 95% CI .997 around 0.000, .950 .990then the result is .878 not .959 significant. 1 -.996 to .996 2 -.949 to .949 3 -.877 to .877 4 to .810value .811 .917 Does the-.810 absolute 5 -.753 to .753 .754 .874 Find your degrees of or exceed 6 r equal -.706 to .706 the .707 .834 of freedom (n -2) 7value in-.665 .665 p .666 cannot .798 thistocolumn? You reject in-.631 thistocolumn 8 .631 .632 .765 the null hypothesis. 9 -.601 to .601 .602 .735 You can 10 -.575 to .575 .576 .708use it in the r is significant with 11 -.552 to .552 .553 regression .684 equation to alpha = .05. You must 12 -.531 to .531assume .532 .661 Y scores. estimate . . . that rho =. 0.00. . . . . . . . . If r is significant you 100 -.194 to .194 .195 .254 can consider it an unbiased, 200 -.137 to .137 .138 .181 300 -.112 to .112 estimate .113 least squares of rho. .148 500 -.087 toalpha .087 = .05..088 .115 1000 -.061 to .061 .062 .081 2000 -.043 to .043 .044 .058 10000 -.019 to .019 .020 .026 Example : Achovy pizza and horror films, rho=0.000 (scale 0-9) H1: People who enjoy food with strong flavors also enjoy other strong sensations. H0: There is no relationship between enjoying food with strong flavors and enjoying other strong sensations. horror anchovies films 7 7 7 9 3 8 3 6 0 9 8 6 4 5 1 2 1 1 1 6 Can we reject the null hypothesis? 13 Can we reject the null hypothesis? 8 6 Pizza 4 2 0 0 2 4 6 8 Horror films 14 Can we reject the null hypothesis? We do the math and we find that: r = .352 df = 8 15 r table df nonsignificant .05 .01 1 2 3 4 5 6 7 -.996 to .996 -.949 to .949 -.877 to .877 -.810 to .810 -.753 to .753 -.706 to .706 -.665 to .665 -.631 to .631 -.601 to .601 -.575 to .575 -.552 to .552 -.531 to .531 . . . -.194 to .194 -.137 to .137 -.112 to .112 -.087 to .087 -.061 to .061 -.043 to .043 -.019 to .019 .997 .950 .878 .811 .754 .707 .666 .9999 .990 .959 .917 .874 .834 .798 .765 .735 .708 .684 .661 . . . .254 .181 .148 .115 .081 .058 .026 8 9 10 11 12 . . . 100 200 300 500 1000 2000 10000 .632 .602 .576 .553 .532 . . . .195 .138 .113 .088 .062 .044 .020 This finding falls within the CI.95 around 0.000 We call such findings “nonsignificant” Nonsignificant is abbreviated n.s. We would report these finding as follows r (8)=0.352, n.s. Given that it fell inside the CI.95, we must assume that rho actually equals zero and that our sample r is .352 instead of 0.000 solely because of sampling fluctuation. We go back to predicting that everyone will score at the mean of Y. 17 How to report a significant r For example, let’s say that you had a sample (nP=30) and r = -.400 Looking under nP-2=28 dfREG, we find the interval consistent with the null is between .360 and +.360 So we are outside the CI.95 for rho=0.000  We would write that result as r(28)=-.400, p<.05 That tells you the dfREG, the value of r, and that you can expect an r that far from 0.000 five or fewer times in 100 when rho = 0.000 18 Then there is Column 4  Column 4 shows the values that lie outside a CI.99  (The CI.99 itself isn’t shown like the CI.95 in Column 2 because it isn’t important enough.)  However, Column 4 gives you bragging rights.  If your r is as far or further from 0.000 as the number in Column 4, you can say there is 1 or fewer chance in 100 of an r being this far from zero (p<.01).  For example, let’s say that you had a sample (nP=30) and r = -.525.  The critical value at .01 is .463. You are further from 0.00 than that.So you can brag.  You write that result as r(28)=-.525, p<.01. 19 To summarize If r falls inside the CI.95 around 0.000, it is nonsignificant (n.s.) and you can’t use the regression equation (e.g., r(28)=.300, n.s. If r falls outside the CI.95, but not as far from 0.000 as the number in Column 4, you have a significant finding and can use the regression equation (e.g., r(28)=-.400,p<.05 If r is as far or further from zero as the number in Column 4, you can use the regression equation and brag while doing it (e.g., r(28)=.525, p<.01 20 df 1 2 3 4 5 6 7 8 9 10 11 12 . . . 100 200 300 500 1000 2000 10000 nonsignificant .05 .01 -.996 to .996 -.949 to .949 -.877 to .877 -.810 to .810 -.753 to .753 -.706 to .706 -.665 to .665 -.631 to .631 -.601 to .601 -.575 to .575 -.552 to .552 -.531 to .531 . . . -.194 to .194 -.137 to .137 -.112 to .112 -.087 to .087 -.061 to .061 -.043 to .043 -.019 to .019 .997 .950 .878 .811 .754 .707 .666 .632 .602 .576 .553 .532 . . . .195 .138 .113 .088 .062 .044 .020 .9999 .990 .959 .917 .874 .834 .798 .765 .735 .708 .684 .661 . . . .254 .181 .148 .115 .081 .058 .026 Can you reject H0? r = .386 np= 19 dfREG = 17 df nonsignificant .05 .01 10 11 12 13 14 15 16 17 18 19 . . . 40 50 60 -.575 to .575 -.552 to .552 -.531 to .531 -.513 to .513 -.496 to .496 -.481 to .481 -.467 to .467 -.455 to .455 -.443 to .443 -.432 to .432 . . . -.303 to .303 -.272 to .272 -.249 to .249 .576 .553 .532 .514 .497 .482 .468 .456 .444 .433 . . . .304 .273 .250 .708 .684 .661 .641 .623 .606 .590 .575 .561 .549 . . . .393 .354 .325 22 Can you reject H0? r = -.386 np= 47 dfreg = 45 df nonsignificant .05 .01 10 11 12 13 14 15 16 17 18 19 . . . 40 50 60 -.575 to .575 -.552 to .552 -.531 to .531 -.513 to .513 -.496 to .496 -.481 to .481 -.467 to .467 -.455 to .455 -.443 to .443 -.432 to .432 . . . -.303 to .303 -.272 to .272 -.249 to .249 .576 .553 .532 .514 .497 .482 .468 .456 .444 .433 . . . .304 .273 .250 .708 .684 .661 .641 .623 .606 .590 .575 .561 .549 . . . .393 .354 .325 23 How much better than the mean can we guess? 24 Improved prediction If we can use the regression equation rather than the mean to make individualized estimates of Y scores, how much better are our estimates? We are making predictions about scores on the Y variable from our knowledge of the statistically significant correlation between X & Y and the fact that we know someone’s X score. The average unsquared error when we predict that everyone will score at the mean of Y equals sY, the ordinary standard deviation of Y. How much better than that can we do? 25 Estimating the standard error of the estimate the (very) long way. Calculate correlation (which includes calculating s for Y). If the correlation is significant, you can use the regression equation to make individualized predictions of scores on the Y variable. The average unsquared error of prediction when you do that is called the estimated standard error of the estimate. 26 Example for Prediction Error A study was performed to investigate whether the quality of an image affects reading time. The experimental hypothesis was that reduced quality would slow down reading time. Quality was measured on a scale of 1 to 10. Reading time was in seconds. 27 Quality vs Reading Time data: Compute the correlation Quality Reading time (scale 1-10) (seconds) 4.30 8.1 Is there a relationship? 4.55 8.5 Check for linearity. 5.55 7.8 Compute r. 5.65 7.3 6.30 7.5 6.45 7.3 6.45 6.0 28 Calculate t scores for X X 4.30 4.55 5.55 5.65 6.30 6.45 6.45 X=39.25 n= 7 X=5.61 X-X -1.31 -1.06 -0.06 0.04 0.69 0.84 0.84 (X - X)2 1.71 1.12 0.00 0.00 0.48 0.71 0.71 tX = (X - X) / sX -1.48 -1.19 -0.07 0.05 0.78 0.95 0.95 SSW = 4.73 MSW = 4.73/(7-1) = 0.79 s = 0.89 29 Calculate t scores for Y Y 8.1 8.5 7.8 7.3 7.5 7.3 6.0 Y=52.5 n= 7 Y=7.50 Y-Y 0.60 1.00 0.30 -0.20 0.00 -0.20 -1.50 (Y - Y)2 0.36 1.00 0.09 0.04 0.00 0.04 2.25 tY = (Y - Y) / sY 0.76 1.26 0.38 -025 0.00 -0.25 -1.89 SSW = 3.78 MSW = 3.78/(7-1) = 0.63 sY = 0.79 30 Plot t scores tX tY -1.48 -1.19 -0.07 0.05 0.78 0.95 0.95 0.76 1.28 0.39 -0.25 0.00 -0.25 -1.89 31 t score plot with best fitting line: linear? YES! Reading Time (t score) 2.00 1.00 -2.00 0.00 -1.00 0.00 1.00 2.00 -1.00 -2.00 Image quality (t score) 32 Calculate r tX tY -1.48 -1.19 -0.07 0.05 0.78 0.95 0.95 0.76 1.28 0.39 -0.25 0.00 -0.25 -1.88 tY -tX (tY -tX)2 -2.24 5.02 -2.47 6.10 -0.46 0.21 0.30 0.09 0.78 0.61 1.20 1.44 2.83 8.01  (tX - tY)2 = 21.48  (tX - tY)2 / (nP - 1) = 3.580 r = 1 - (1/2 * 3.580) = 1 - 1.79 = -0.790 33 Check whether r is significant r = -0.790 df = nP-2 = 5  is .05 Look in r table:With 5 dfREG, the CI.95 goes from -.753 to +.753 r(5)= -.790, p <.05 r is significant! 34 We can calculate the Y' for every raw X X 4.30 4.55 5.55 5.65 6.30 6.45 6.45 Y' 8.42 8.23 7.54 7.47 7.01 6.91 6.91 35 Can we show mathematically that regression estimates are better than mean estimates? Y 8.1 8.5 7.8 7.3 7.5 7.3 6.0 Y 7.5 7.5 7.5 7.5 7.5 7.5 7.5 Y' 8.42 8.23 7.54 7.47 7.01 6.91 6.91 We expect of course that there will be less error if we use regression. To calculate the standard deviation we take deviations of Y from the mean of Y, square them, add them up, divide by degrees of freedom, and then take the square root. To calculate the standard error of the estimate, sEST, we will take the deviations of each raw Y score from its regression equation estimate, square them, add them up, divide by degrees of freedom, and take the square root. 36 Estimated standard error of the estimate Y 8.1 8.5 7.8 7.3 7.5 7.3 6.0 Y' 8.42 8.23 7.54 7.47 7.01 6.91 6.91 Y - Y' -0.28 0.27 0.26 -0.23 0.49 0.39 -0.91 (Y - Y')2 0.08 0.07 0.07 0.05 0.24 0.15 0.83 SSRES = 1.49 MSRES = 1.49/(7-2) = 0.30 SEST = 0.546 37 How much better? SY = 0.80 SEST = 0.546 .80  .546  .32  32% .80 32% less error when use the regression equation instead of the mean to predict. 38 Mathematical magic There is usually an alternative formula for calculating statistics that is easier to perform. We went through a lot of extra steps to calculate SEST = 0.546. It is not necessary to calculate all of the Y’s.. 39 Another way to phrase it: How much error did we get rid of? Treat it as a weight loss problem. If Jack is 30 pounds overweight and he loses 40% of it, how much is he still overweight. He lost .400 x 30 pounds = 12 pounds. He has 30 – 12 = 18 pounds left to lose. 40 SSY= error to start r2=percent of error lost SSY is the total amount of error we start with when prediction scores on Y. It is the amount of error when everyone is predicted to score at the mean. The proportion of error you get rid of using the regression equation as your predictor equals Pearson’s correlation coefficient squared (r2) 41 To get the total error left find how much you got rid of, then subtract from what you started with Amount you got rid of: SSY * r2 Amount left: SSRES = SSY – (SSY * r2 ) Average amount of squared error left: MSRES = SSRES/dfREG = SSRES/(nP-2) sEST = square root of MSRES 42 Computing sEST the easier way! We already knew that SSY = 3.80 and r = -0.790. SSRES = SSY - (SSY * r2) = 3.80 - (3.80 * (-0.79)2) = 1.43 MSRES = 1.43/(7-2) = 0..286 SEST = 0.535 43 How much better? SY = 0.80 SEST = 0.535 .80  .535  .33  33% .80 33% less error when use the regression equation instead of the mean to predict. The difference between 33% and 32% when we calculated using the long way is due to rounding error. 44 Stating the obvious: The estimated standard deviation (s) was the estimated average unsquared distance of scores in the population from mu. The estimated standard error of the estimate (sEST) is the estimated average unsquared distance of scores in the population from the regression equation based predicted Y scores.  Both reflect the error of prediction. Using the regression equation individualizes prediction and, if r is significant, leads to less error. 45 Do one yourself. Assume the original sum of squares for error is 420.00, nP=22 and the sum of the squared differences between the tX and tY scores is 12.60. What is r? Is r statistically significant? Write the results as you would in a report. What is the estimated average unsquared distance of Y scores from the regression line? What percent improvement is obtained when s is compared to sEST? 46 Answers: What is r? Is it significant? Compute r  (tX - tY)2 = 12.60  (tX - tY)2 / (nP - 1) = 12.60/21=.600 r=1.000-1/2(.600) = .700 Is r significant? r(20)= .700, p<.01 47 What is the estimated average unsquared distance of scores in the population from the regression line? That is the same as asking “What is the estimated standard error of the estimate?” SSRES = SSY - (SSY * r2) = 420.00 – [420.00 * (0.70)2] = 214.20 MSRES = 214.20/(20) = 10.71 SEST = 3.27 48 What percent improvement is obtained when s is compared to sEST? MSW=SSW/df = 420.00/21 = 20.00  s  20.00  4.47 49 Last and (perhaps) least: Proportion improvement = (s-sEST)/s (4.47 – 3.27)/4.47=.268  Percent improvement = proportion improvement *100 In this case there was about a 26.8% improvement in unsquared error when you use the regression equation rather than the mean as your basis for predicting Y scores. 50 End chapter 8 slides here Slides past here were not covered in lecture and will not be on the exam. 51 Error Types: Type 1 Error Type 1 error occurs when you accidentally get a random sample with an r outside the range predicted by the null hypothesis even though rho=0.000. This forces you to reject the null hypothesis when there really is no relationship between X and Y in the population as a whole. Scientists are conservative and set up conditions to avoid Type 1 errors. 52 Error Types: Type 2 Error A type 2 error can only occur when there really is a correlation between X and Y in the population, but you accidentally get a sample r that falls within the range predicted by the null hypothesis. You must then fail to reject the null and assume rho=0.000 This is incorrect and results in Type 2 error. 53 Alpha levels Any result can be found by chance. However some results are so strong that they are very unlikely. Unlikely is defined as occuring by chance 5 (or fewer) times in 100. The risk of getting a weird sample that causes a Type 1 error is called alpha.  = .05 54

Chapter 8 Review, Part 2

Related documents

Products

Support

Chapter 8 Review, Part 2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib