Exercises L6 – Relations between variables 1. a. b. c. 2. a. b. c. 3. a. b. 4. y x a. b. c. d. e. f. 5. a. b. c. d. In 1984 Geraldine Ferraro was the first female candidate for the vice-presidency of the US. Time Magazine held in august 1984 a poll and used two random samples of 500 male and 500 female voters. They were all asked who, in their opinion, would be the best vice president, Bush or Ferraro? Bush Ferraro No preference The possible answers were: Bush, Ferraro or no Male voters 245 155 100 Female voters 188 235 80 preference. Find the individual distribution of the preference and compare this distribution to the conditional preference of male and female voters. Do you expect that this difference is significant? Is this a one or a multiple sample set up? Conduct the test on homogeneity to investigate whether your expectation at a is correct at 5% level of significance. Business Week investigated the behaviour of travellers by Type of flight Continental Intercontinental asking them whether they booked a continental or a Business class 29 22 intercontinental flight and what kind of ticket they Type of ticket Comfort class 95 121 bought. Economy class 518 135 Here are the results: Find the conditional distributions for the variable “Type of ticket” given the type of flight and compare them. Is this a one or a multiple sample set up? Test whether there is a relation between the type of flight and the choice of the type of the ticket. (α = 0.05) Green pea plants have different characteristics: the colour of the flowers can either be blue or red and the shape of pollen can be either long or round. A biologist observed many plants and tried to find out whether these characteristics can be considered independent. After observing 427 green pea Colour plants he published the following table: Blue Red Total Long 296 27 323 Use the chi-square test to show, whether or not, Pollen Pollen Round 19 85 104 and Colour are independent. Total 315 112 427 The biologist knows that the characteristics “blue flower” and “long pollen” are dominant and expects proportions 3:1 for both marginal distributions (individual distributions of colour and pollen). In case of independence he expects the cell frequencies have proportions 9:3:3:1. Do these data confirm these assumptions? The distribution, given these assumptions, is completely specified: that is why you should use a chi square test with rc -1 degrees of freedom in this case. In a survey of 15 arbitrarily chosen patients a researcher examined the relation between blood flow (y) in the brains and the oxygen tension (x) in blood. The oxygen tension is easy to measure: a reliable positive linear relation would be convenient. The researcher used a sample of 15 patients to sort this relation out. 84.33 87.80 82.20 78.21 78.44 80.01 83.53 79.46 75.22 76.58 77.90 78.80 80.67 86.60 78.20 603.4 582.5 556.2 594.6 558.9 575.2 580.1 451.2 404.0 484.0 452.4 448.4 334.8 320.3 350.3 Plot the data (use SPSS if available). Compute the correlation coefficient and the estimated regression line, using your calculator. Compute the residuals (check whether the sum of the residuals is approximately 0) and the regression variance s2. Test (α = 0.05) whether there is a positive linear relation between x and y. Check b.-d. using the simple linear regression menu of SPSS. Report the value of the test statistic and the p-value. Find a 90% confidence and a 90% prediction interval for the blood flow if the oxygen tension x = 600. Explain which of the two intervals should be used in case of a patient who needs treatment. (Exercise 2, see descriptive linear regression, ): The USA is spending more and more money on Defence. What is the relation between the budget for Defence and the economy? Here are the figures of the Defence budget (in dollars per head) and the GNP (gross Year 1976 1977 1978 1979 1980 1981 1982 national product, in Defence budget 423 465 499 553 631 739 846 billions of dollars) in GNP 2829 2959 3115 3192 3187 3249 3166 the USA during 7 years. sX sY xi yi n r r2 a = b1 b = b0 The results of exercise 2 7 593.71 154.10 3099.57 150.630 12982607 0.724 0.524 0.707 2679.603 were (x = Defence budget): Find a 90% confidence interval for the increase of the GNP per dollar per head increase of the Defence budget. Use the interval of a. to answer the question whether H0 : β1 = 0 can be rejected versus H1 : β1 > 0 and state the correct value of α. Predict the GNP in a year where the Defence budget = 700 $ per head, using an interval at confidence level of 95%. Use SPSS to check the interval at a. and the test at b and draw a scatter diagram including the fitted line. If possible, also draw the confidence and prediction intervals. Solutions Exercise 1 a. The differences are +12%, 16% and +4%: could be significant b. Two samples : n1 = 500 and n2 = 500. c. Summary: X2 = 27.01, p-value = P( χ22 ≥ 27.01) < 0.0005 ( df = (2-1)(3-1) = 2 ) (or critical value c = 5.991 for this right sided test) Man Female O = 245 (49%) O = 185 (37%) Man Female Total preference Bush O= 245, E = 215 O = 185 , E = 215 430 Exercise 2 a. (the differences seem large) -> b. a one sample approach for every traveller the type of flight and the type of ticket is recorded: the row and column totals are stochastic c. Before applying the chi square test for independence we will first compute the row and column totals and the expected values in case of independence. E.g. for the first cell: observed value O = 29 and expected E = rowtotal columntotal n O = 155 (31%) O = 235 (47%) Ferraro O = 155, E = 195 O = 235, E = 195 390 ticket Business class Comfort class Economy class Continental O = 29 4.5% O = 95 14.8% O = 518 80.7% 642 100.0% 500 (100%) 500 (100%) No preference O = 100 , E = 90 O = 80, E = 90 180 total 500 500 1000 Intercontinental O = 22 7.9% O = 121 43.5% O = 135 48.6% 278 100.0% Total 51 216 653 920 642 35.6 51920 (see table below) 1. Is there a relation between type of flight and type of ticket? 2. We will apply the chi square test on independence because we have a 32cross table for two nominal variables. 3. We will test H0 : independence versus H1 : a relation between the types of flight and ticket at α = 0.05 4. Test statistic Χ2 O = 100 (20%) O = 80 (16%) Type of ticket Business class Comfort class Economy class Type of flight Continental O = 29 E = 35.6 O= 95 E = 150.7 O = 518 E = 455.7 642 Intercontinental O = 22 E = 15.4 O = 121 E = 65.3 O = 135 E = 197.3 278 Total 51 216 653 920 OE 2 ~ χ2 ( df = (3-1)(2-1) = 2 ) if H is true. 2 0 E 5. Observed value Χ2 OE2 (2935.6)2 ... (135197.3)2 = 100.3 35.6 197.3 E 6. Right sided test: if X ≥ c, then reject H0: c = 5.99 uit de χ22 -table at α = 0.05 7. In this case X2 = 100.3 ≥ c => reject H0 8. A relation between type of flight and type of ticket is proven at 5% level. 2 Exercise 3. a. X2 = 218.4 , P-value = P( χ2 ≥ 218.4) < 0.0005 Man Female Total b. X2 ≈ 219 Exercise 4 a. b. Colour Blue Red O = 296, E = 238.3 O = 27, E =84.7 O = 19, E = 76.7 O = 85, E = 27.4 315 112 total 323 104 427 Plot -> x 486.42 , sx 100.41 , y 80.53 , s y 3.65, r = 0.199 ( r2 = 0.0395), b1 = 0.007224, b0 = 77.016, c. d. 1. Does the oxygen tension 2. Assumptions: ≈ 178.1 s2 = 178.1/(n-2) = 13.700 => s = 3.701 Yi = β0 + β1xi + εi , (x) influence the blood flow (y) in the brains positively? in which ε1,..., ε15 are independent and N(0, σ2)-distr. 3. Test H0 : β1 = 0 versus H1 : β1 > 0 , α = 0,05 4. 88.00 86.00 84.00 5. Observed value of T: t = 82.00 0.007224 0.007224 0.73 , where 3.701/ 375.70 0.009851 y = 80.00 2 2 2 ( xi x ) 14 sx 14 (100.41) 141150.35 78.00 6. Right sided test: if T ≥ c => reject H0. P(T 13 ≥ c)= 0.05=> c = 1.771 7. T = 0.73 < c => accept H0 8. A positive linear relation between the oxygen tension and the blood flow in the brains is not proven at 5%-level. 76.00 74.00 300.0 400.0 500.0 600.0 x e. f. The (1-α)100%-CI( The (1-α)100%-PI( where P(Tn-2 ≥ c) = α/2 ) has bounds: where P(Tn-2 ≥ c) = α/2 ) has bounds: In this case n = 15, x* = 80, α = 10%, c = 1.771 (t(13)-distribution, tail probability α/2 = 5%) , ( xi x )2 141150.35 s = 3.701, = 77.016 +0.007224600 = 81.4 90%-CI( ) = (79.4, 83.4) 90%-PI( ) = (74.6, 88.2) We should use the last (very wide) interval because we want to predict one particular blood flow. Exercise 5 a. 90%-CI( ) (0.10, 1.32) where P(T7-2 ≥ c) = α/2 = 0.05 : c = 2.015, = 64837 => s = b. ≈ 113.9 is not in the 90%-CI( ) => H0 : β1 = 0 can be rejected versus the one-sided H1 : β1 > 0 at α = 5%. c. The 95%-PI( ) has bounds: where n = 7, P(T7-2 ≥ c) = α/2 = 2.5% => c = 2.571, = 2679.6 + 0.707700 = 2719 and s = 113.9. = 2719 ± 324 → (2385, 3043)