Solution to 5.12 Scatterplot of y vs x 80 y 70 60 50 40 0 5 10 15 20 25 x a. Comment on the features: i. Trend seems to be linear with one outlying observation ii. There are only 9 data points they look pretty scattered. Relationship may not be that strong. b. Looks that value y decreases with increasing value of x. that is the linear relationship seems to be negative c. Following table gives the computation of ‘r’ Mean(x) = 16, mean(y)=53, s.d(x) = 6.82, s.d.(y) = 10.02 How do I compute these numbers? What are the formulae? (see the solutions to problem from chapter 4) Now to compute r, first we compute the Z scores for all x-values as well as for all yvalues in the following table. (How to compute the z-score? What is the formula? (see the solutions to problem from chapter 4) x Zx 23 14 14 0 17 20 20 15 21 Z Z x r y Z y 1.02639 -0.29326 -0.29326 -2.34604 0.14663 0.58651 0.58651 -0.14663 0.73314 Zy 43 59 48 77 50 52 46 51 51 ZxZy -0.998 0.5988 -0.499 2.39521 -0.2994 -0.0998 -0.6986 -0.1996 -0.1996 -1.02434 -0.1756 0.146337 -5.61926 -0.0439 -0.05853 -0.40974 0.029267 -0.14633 = sum of the last column in the table above = -7.3021 x Zy = -7.3021/8 = -.91276 n 1 i. Since the value of r is a negative number, indicates that there is negative linear relationship between x and y. ii. Since the value of r is pretty close to -1, we conclude that there is a strong linear relationship between x and y. iii. The value of r supports our speculation in part b. d. The strong negative linear relationship between anxiety and exam scores indicate that anxiety explains the poor score on the exam. 5.18 x= grade in which a student is studying (independent variable) y = percentage of students who are likely to purchase (dependent variable) Scatterplot of Percenta vs Grade 90 80 Percenta 70 60 50 40 30 6 7 8 9 Grade 10 11 12 a. comments on the features of scatter plot i. Though there are only four data points, the scatter diagram shows a linear trend among the points ii. Seems the relationship between x and y is positive. b. to find the equation is to find the value of a and b in the relation y = a+bx x x2 y y2 xy 6 36 32.7 1069.29 196.20 8 64 46.1 2125.21 368.80 10 100 75.0 5625.00 750.00 12 144 83.6 6988.96 1003.20 2 2 Σx = 36 Σx = 344 Σy = 237.4 Σy = 15808.5 Σxy = 2318 b xy x 2 x y n ( x) 2 and a y bx n plugging in the sums in the formulae we get b = 9.08 a = -22.4 the fitted line is y = -22.4 + 9.08*x slope = 9.08 y-intercept = -22.4 since slope is positive hence the relationship between x and y is positive that is value of y increases with increase in x. the interpretation of slope : if value of x increases by 1 unit then on the average value of y increases by apporx 9 units. How to predict? If students are in 7th grade what is the corresponding percentage? That is predict y for x = 7 We plug in the x=7 in the fitted line. Thus the predicted value y = -22.4+9.08*7 = 40.86 That is approximately 41% of the 7th grader are more likely to purchase the tickets. 5.19 a. dependent variable= y = # fruit/vegetable servings/day predictor variable = x = # TV viewing hours/day b. the line will have negative slope since # servings decrease with increase in # TV viewing hours. 5.31 Scatterplot of y vs x 6.0 y 5.5 5.0 4.5 80 82 84 86 88 90 x a. comments on the scatter plot i. The points on the scatter plot show a liner trend. ii. The linear trend seems to be with a positive slope that is value of y is in general increasing with value of x iii. The points a pretty much scattered, indicating that the relationship may not be that strong. b. given that Σx = 1368.1 Σx2 = 117123.85 Σy = 80.9 Σy2 = 412.81 Σxy = 6933.48 We plug in these sums in the formula to determine values of b and a. the formulae are b xy x 2 x y n ( x) 2 and a y bx n 1368.1 * 80.9 80.9 1368.1 16 .1123 * b= = .1123 and a = = -4.54 2 16 (1368.1) 16 117123.85 16 6933.48 c.(this is interpretation of slope) when the chest circumference (x) increases by 1 cm, then on the average the corresponding vital capacity increases by .1123 units. When the chest circumference (x) increases by 10 cm, then on the average the corresponding vital capacity increases by 10*.1123= 1.123 units. d. (this is prediction) when x = 85 the corresponding y should be y = -4.54 + .1123*85 = 5.0055 that is when chest circumference is 85 cm, the average vital capacity is 5.0055 units. e. No, the vital capacity can not be determined completely by the chest circumference because there are factors other than chest circumference that affect the vital capacity. For example in the dataset note that corresponding to chest circumference = 88.1cm there are two different values of vital capacity namely 5.7 and 5.2. This indicates that other factors also play role in determining the vital capacity. 5.37 Scatterplot of y vs x 60 y 50 40 30 20 40 50 60 70 x 80 90 100 given that line fitted to given data has equation y = 64.50 - .45x Fitted Line Plot y = 64.50 - 0.4500 x 60 S R-Sq R-Sq(adj) 7.08520 69.3% 63.2% y 50 40 30 20 40 50 60 70 x 80 90 100 b. the red dots show the data points given to us (our sample data). The blue line represents the line fitted y = 64.50 - .45x R-sq= coefficient of determination = (square of ‘r’) r = R-sq = .832 looks like the y-value corresponding to x=40 is kind of outlier. It is far away from the fitted line. This will result in a large residual value. c. residual = observed value of y- fitted value of y = y- y this is the error in our prediction. Following table shows the computation of residuals y = 64.50 - .45x x y residual=y- y (amount of error in prediction) 40 58 46.5 11.5 50 34 42.0 -8.0 60 32 37.5 -5.5 70 30 33.0 -3.0 80 28 28.5 -0.5 90 27 24.0 3.0 100 22 19.5 2.5 Scatterplot of residuals vs x 10 residuals 5 0 0 -5 -10 40 50 60 70 x 80 90 100 i. This plot shows that the residual corresponding to x=40 is very large compared to others. ii. This plot also show that in general residuals(amount of error in prediction) decrease with increase in value of x. decrease=approaching zero If we delete this observation from our computation and recalculate the line, we get the following graph and line Fitted Line Plot Ynew = 45.33 - 0.2200 Xnew 36 S R-Sq R-Sq(adj) 34 1.01653 95.3% 94.2% Ynew 32 30 28 26 24 22 50 60 70 80 90 100 Xnew 1. note that most of the data values are now pretty close to the fitted line 2. note that coefficient of determination is increased to 95.3% which increases the reliability of our prediction. 3. r = .953 = .976 4. in nutshell, removing an extreme point from the given data set gives us an improved and more reliable prediction line.