Exercises L6: Cross tables and linear regression (including answers)

advertisement
Exercises L6 – Relations between variables
1.
a.
b.
c.
2.
a.
b.
c.
3.
a.
b.
4.
y
x
a.
b.
c.
d.
e.
f.
5.
a.
b.
c.
d.
In 1984 Geraldine Ferraro was the first female candidate for the vice-presidency of the US. Time Magazine held in august
1984 a poll and used two random samples of 500 male and 500 female voters. They were all asked who, in their opinion,
would be the best vice president, Bush or Ferraro?
Bush
Ferraro
No preference
The possible answers were: Bush, Ferraro or no
Male voters
245
155
100
Female voters
188
235
80
preference.
Find the individual distribution of the preference and compare this distribution to the conditional preference of male and
female voters. Do you expect that this difference is significant?
Is this a one or a multiple sample set up?
Conduct the test on homogeneity to investigate whether your expectation at a is correct at 5% level of significance.
Business Week investigated the behaviour of travellers by
Type of flight
Continental Intercontinental
asking them whether they booked a continental or a
Business class
29
22
intercontinental flight and what kind of ticket they
Type
of
ticket
Comfort
class
95
121
bought.
Economy class
518
135
Here are the results:
Find the conditional distributions for the variable “Type of ticket” given the type of flight and compare them.
Is this a one or a multiple sample set up?
Test whether there is a relation between the type of flight and the choice of the type of the ticket. (α = 0.05)
Green pea plants have different characteristics: the colour of the flowers can either be blue or red and the shape of pollen
can be either long or round. A biologist observed many plants and tried to find out whether these characteristics can be
considered independent. After observing 427 green pea
Colour
plants he published the following table:
Blue
Red
Total
Long
296
27
323
Use the chi-square test to show, whether or not, Pollen
Pollen
Round
19
85
104
and Colour are independent.
Total
315
112
427
The biologist knows that the characteristics “blue
flower” and “long pollen” are dominant and expects proportions 3:1 for both marginal distributions (individual distributions
of colour and pollen). In case of independence he expects the cell frequencies have proportions 9:3:3:1. Do these data confirm
these assumptions? The distribution, given these assumptions, is completely specified: that is why you should use a chi square
test with rc -1 degrees of freedom in this case.
In a survey of 15 arbitrarily chosen patients a researcher examined the relation between blood flow (y) in the brains and the
oxygen tension (x) in blood. The oxygen tension is easy to measure: a reliable positive linear relation would be convenient.
The researcher used a sample of 15 patients to sort this relation out.
84.33 87.80 82.20 78.21 78.44 80.01 83.53 79.46 75.22 76.58 77.90 78.80 80.67 86.60 78.20
603.4 582.5 556.2 594.6 558.9 575.2 580.1 451.2 404.0 484.0 452.4 448.4 334.8 320.3 350.3
Plot the data (use SPSS if available).
Compute the correlation coefficient and the estimated regression line, using your calculator.
Compute the residuals (check whether the sum of the residuals is approximately 0) and the regression variance s2.
Test (α = 0.05) whether there is a positive linear relation between x and y.
Check b.-d. using the simple linear regression menu of SPSS. Report the value of the test statistic and the p-value.
Find a 90% confidence and a 90% prediction interval for the blood flow if the oxygen tension x = 600.
Explain which of the two intervals should be used in case of a patient who needs treatment.
(Exercise 2, see descriptive linear regression, ): The USA is spending more and more money on Defence. What is the
relation between the budget for Defence and the economy? Here are the figures of the Defence budget (in dollars per head)
and the GNP (gross
Year
1976
1977
1978
1979
1980
1981
1982
national product, in
Defence budget
423
465
499
553
631
739
846
billions of dollars) in
GNP
2829
2959
3115
3192
3187
3249
3166
the USA during 7
years.
sX
sY
xi yi
n
r
r2
a = b1
b = b0
The results of exercise 2
7 593.71 154.10 3099.57 150.630 12982607
0.724 0.524 0.707 2679.603
were (x = Defence budget):
Find a 90% confidence interval for the increase of the GNP per dollar per head increase of the Defence budget.
Use the interval of a. to answer the question whether H0 : β1 = 0 can be rejected versus H1 : β1 > 0 and state the correct
value of α.
Predict the GNP in a year where the Defence budget = 700 $ per head, using an interval at confidence level of 95%.
Use SPSS to check the interval at a. and the test at b and draw a scatter diagram including the fitted line. If possible,
also draw the confidence and prediction intervals.
Solutions
Exercise 1 a. The differences are +12%, 16% and +4%: could be significant
b. Two samples : n1 = 500 and n2 = 500.
c. Summary: X2 = 27.01,
p-value = P( χ22 ≥ 27.01) < 0.0005
( df = (2-1)(3-1) = 2 ) (or critical value
c = 5.991 for this right sided test)
Man
Female
O = 245 (49%)
O = 185 (37%)
Man
Female
Total
preference
Bush
O= 245, E = 215
O = 185 , E = 215
430
Exercise 2 a. (the differences seem large) ->
b. a one sample approach for every traveller the type of flight
and the type of ticket is recorded: the row and column totals are
stochastic
c. Before applying the chi square test for independence we will
first compute the row and column totals and the expected
values in case of independence. E.g. for the first cell: observed
value O = 29 and
expected E =
rowtotal columntotal
n
O = 155 (31%)
O = 235 (47%)
Ferraro
O = 155, E = 195
O = 235, E = 195
390
ticket
Business class
Comfort class
Economy class
Continental
O = 29
4.5%
O = 95
14.8%
O = 518
80.7%
642
100.0%
500 (100%)
500 (100%)
No preference
O = 100 , E = 90
O = 80, E = 90
180
total
500
500
1000
Intercontinental
O = 22
7.9%
O = 121
43.5%
O = 135
48.6%
278
100.0%
Total
51
216
653
920
642  35.6
 51920
(see table below)
1. Is there a relation between type of flight
and type of ticket?
2. We will apply the chi square test on
independence because we have a 32cross table for two nominal variables.
3. We will test H0 : independence versus H1
: a relation between the types of flight
and ticket at α = 0.05
4. Test statistic Χ2  
O = 100 (20%)
O = 80 (16%)
Type of ticket
Business class
Comfort class
Economy class
Type of flight
Continental
O = 29
E = 35.6
O=
95
E = 150.7
O = 518
E = 455.7
642
Intercontinental
O = 22
E = 15.4
O = 121
E = 65.3
O = 135
E = 197.3
278
Total
51
216
653
920
OE 2 ~ χ2 ( df = (3-1)(2-1) = 2 ) if H is true.
2
0
E
5. Observed value Χ2  
OE2  (2935.6)2  ...  (135197.3)2 = 100.3
35.6
197.3
E
6. Right sided test: if X ≥ c, then reject H0: c = 5.99 uit de χ22 -table at α = 0.05
7. In this case X2 = 100.3 ≥ c => reject H0
8. A relation between type of flight and type of ticket is proven at 5% level.
2
Exercise 3. a. X2 = 218.4 ,
P-value = P( χ2 ≥ 218.4) < 0.0005
Man
Female
Total
b. X2 ≈ 219
Exercise 4
a.
b.
Colour
Blue
Red
O = 296, E = 238.3 O = 27, E =84.7
O = 19, E = 76.7
O = 85, E = 27.4
315
112
total
323
104
427
Plot ->
x  486.42 , sx  100.41 , y  80.53 , s y  3.65, r = 0.199 ( r2 = 0.0395), b1 = 0.007224,
b0 = 77.016,
c.
d. 1. Does the oxygen tension
2. Assumptions:
≈ 178.1 s2 = 178.1/(n-2) = 13.700 => s = 3.701
Yi = β0 + β1xi + εi ,
(x) influence the blood flow (y) in the brains positively?
in which ε1,..., ε15 are independent and N(0, σ2)-distr.
3. Test H0 : β1 = 0 versus H1 : β1 > 0 , α = 0,05
4.
88.00
86.00
84.00
5. Observed value of T: t =
82.00
0.007224
0.007224

 0.73 , where
3.701/ 375.70 0.009851
y
=
80.00
2
2
2
 ( xi  x )  14  sx  14  (100.41)  141150.35
78.00
6. Right sided test: if T ≥ c => reject H0.
P(T 13 ≥ c)= 0.05=> c = 1.771
7. T = 0.73 < c => accept H0
8. A positive linear relation between the oxygen
tension and the blood flow in the brains is not proven
at 5%-level.
76.00
74.00
300.0
400.0
500.0
600.0
x
e.
f.
The (1-α)100%-CI(
The (1-α)100%-PI(
where P(Tn-2 ≥ c) = α/2
) has bounds:
where P(Tn-2 ≥ c) = α/2
) has bounds:
In this case n = 15, x* = 80, α = 10%, c = 1.771 (t(13)-distribution, tail probability α/2 = 5%)
,  ( xi  x )2  141150.35
s = 3.701,
= 77.016 +0.007224600 = 81.4
90%-CI(
) = (79.4, 83.4)
90%-PI(
) = (74.6, 88.2)
We should use the last (very wide) interval because we want to predict one particular blood flow.
Exercise 5
a. 90%-CI( )
(0.10, 1.32)
where P(T7-2 ≥ c) = α/2 = 0.05 : c = 2.015,
= 64837 => s =
b.
≈ 113.9
is not in the 90%-CI( ) => H0 : β1 = 0 can be rejected versus the one-sided H1 : β1 > 0 at α = 5%.
c. The 95%-PI(
) has bounds:
where n = 7, P(T7-2 ≥ c) = α/2 = 2.5% => c = 2.571,
= 2679.6 + 0.707700 = 2719 and s = 113.9.
= 2719 ± 324 → (2385, 3043)
Download