Sample Final

advertisement
BSTT 401 Sample Final Exam Spring 2002
1. Suppose that a random sample of five active members in each of four political parties
in a certain country was given a questionnaire purported to measure (on a 100-point
scale) the extent of “general authoritarian attitude toward interpersonal relationships.”
The means and standard deviations of the authoritarianism scores for each party are given
in the following table. (20 pts)
Party 1
85
6
5
ybar
sd
n
Party 2
80
7
5
Party 3
90
14
5
Party 4
70
10
5
data summary;
input prty $ n ybar sd;
delta = sqrt((n-1)/2)*sd;
weight = n - 2;
y = ybar; output;
weight = 1;
y = ybar - delta; output;
y = ybar + delta; output;
lines;
a
5
85
6
b
5
80
7
c
5
90
14
d
5
70
10
run;
proc glm;
class prty;
freq weight;
model y = prty;
means prty / tukey cldiff;
run;
OUTPUT
The GLM Procedure
Class Level Information
Class
prty
Levels
4
Values
a b c d
Number of observations
20
The GLM Procedure
Dependent Variable: y
Frequency: weight
Source
DF
Sum of
Squares
Model
Error
Corrected Total
3
16
19
1093.750000
1524.000000
2617.750000
R-Square
0.417821
Coeff Var
12.01183
Mean Square
F Value
Pr > F
364.583333
95.250000
3.83
0.0306
Root MSE
9.759611
y Mean
81.25000
Source
prty
DF
3
Type I SS
1093.750000
Mean Square
364.583333
F Value
3.83
Pr > F
0.0306
Source
prty
DF
3
Type III SS
1093.750000
Mean Square
364.583333
F Value
3.83
Pr > F
0.0306
The GLM Procedure
Tukey's Studentized Range (HSD) Test for y
NOTE: This test controls the Type I experimentwise error rate.
Alpha
Error Degrees of Freedom
Error Mean Square
Critical Value of Studentized Range
Minimum Significant Difference
0.05
16
95.25
4.04609
17.66
Comparisons significant at the 0.05 level are indicated by ***.
prty
Comparison
c
c
c
a
a
a
b
b
b
d
d
d
-
Difference
Between
Means
a
b
d
c
b
d
c
a
d
c
a
b
5.000
10.000
20.000
-5.000
5.000
15.000
-10.000
-5.000
10.000
-20.000
-15.000
-10.000
Simultaneous 95%
Confidence Limits
-12.660
-7.660
2.340
-22.660
-12.660
-2.660
-27.660
-22.660
-7.660
-37.660
-32.660
-27.660
22.660
27.660
37.660
12.660
22.660
32.660
7.660
12.660
27.660
-2.340
2.660
7.660
***
***
a. State the model for the experiment, using dummy variables.
b. State the hypotheses.
c. Test to see where significant differences exist among parties with
respect to mean authoritarian scores.
d. Based on Tukey’s method of multiple comparisons, identify the pairs in which the
means significantly differ from one another. (Use alpha = .05)
2. The diameters (Y) of three species of pine trees (A) were compared at each of four
locations (B), using samples of three trees per species at each location. Here A is
considered fixed and B, random. The resulting data are given in the data in the following
SAS program. (15 pts)
data mixed;
input A B @@;
do i = 1 to 3;
input yield @@;
output;
end;
drop i;
cards;
1
1
1
1
2
2
2
2
3
3
3
3
;
1
2
3
4
1
2
3
4
1
2
3
4
15.71
16.21
17.32
17.54
17.83
17.68
17.95
18.08
14.78
15.80
16.21
16.99
16.02
16.36
17.03
17.82
17.45
17.70
18.01
18.56
15.03
15.62
16.44
16.39
15.90
16.33
17.22
17.62
16.96
17.52
18.41
18.90
14.63
15.77
16.32
17.02
proc glm;
Title ‘when A is fixed and B is random’;
class A B;
model yield = A B A*B;
random B A*B/test;
run;
Output
WHEN A IS FIXED AND B IS RANDOM
The GLM Procedure
Class Level Information
Class
A
B
Levels
Values
3
1 2 3
4
1 2 3 4
Number of observations
36
The GLM Procedure
Dependent Variable: yield
Source
DF
Sum of
Squares
Model
Error
Corrected Total
11
24
35
39.06083056
1.39026667
40.45109722
R-Square
0.965631
Coeff Var
1.427132
Mean Square
F Value
Pr > F
3.55098460
0.05792778
61.30
<.0001
Root MSE
0.240682
yield Mean
16.86472
Source
A
B
A*B
DF
2
3
6
Type I SS
24.31027222
13.81794167
0.93261667
Mean Square
12.15513611
4.60598056
0.15543611
F Value
209.83
79.51
2.68
Pr > F
<.0001
<.0001
0.0389
Source
A
B
A*B
DF
2
3
6
Type III SS
24.31027222
13.81794167
0.93261667
Mean Square
12.15513611
4.60598056
0.15543611
F Value
209.83
79.51
2.68
Pr > F
<.0001
<.0001
0.0389
The GLM Procedure
Source
A
B
A*B
Type III Expected Mean Square
Var(Error) + 3 Var(A*B) + Q(A)
Var(Error) + 3 Var(A*B) + 9 Var(B)
Var(Error) + 3 Var(A*B)
The GLM Procedure
Tests of Hypotheses for Mixed Model Analysis of Variance
Dependent Variable: yield
Source
DF
Type III SS
Mean Square
F Value
Pr > F
2
3
6
24.310272
13.817942
0.932617
12.155136
4.605981
0.155436
78.20
29.63
<.0001
0.0005
Source
DF
Type III SS
Mean Square
F Value
Pr > F
A*B
Error: MS(Error)
6
24
0.932617
1.390267
0.155436
0.057928
2.68
0.0389
A
B
Error: MS(A*B)
a. State the model
b. State the three possible hypotheses.
c. Test to see where significant differences exist among species, locations as well as the
interactions with respect to mean diameters of the trees.
3. 11 students were asked about their attitude toward biostatistics before and after taking
BSTT401. (10 pts)
BSTT 401
Column totals
Before
After
Opinion about biostatistics
Like
Dislike
2
9
7
4
9
13
Row totals
11
11
22
Here the problem is that the number of students who answered the question is not 22 but
11. Each student answered the question twice before and after taking BSTT401.
Therefore, each observation is not independent. Three students said that they disliked the
biostatistics before and after taking the class.
a. Construct a table about the pairs of responses.
After
Like
Before
Like
Dislike
Column totals
Row totals
Dislike
b. Calculate McNemar’s Test Statistics and see whether their attitude toward biostatistics
significantly changed or not. (refer to the fact that the critical value of X 2 with one
degree of freedom at  =.05 is 3.841).
4. Suppose we want to find out the relationship between lung cancer and BMI and the
following data have been collected. (20 pts)
BMI
Above 30 (Obese)
25 – 30 (Overweight)
Below 25 (Normal)
Heart attacks
Yes
20
10
5
SAS program
Data cancer;
input bmi $ attack $ wt;
if bmi= ‘obese’
then X1 = 1; else X1=0;
if bmi= ‘overwt’
then X2 = 1; else X2 = 0;
if attack = ‘yes’
then Y = 1 ;
else Y = 0;
lines;
obese
yes 25
obese
no
5
overwt yes 10
overwt no
20
normal yes
5
normal no
25
run;
proc logistic descending;
weight wt;
model Y = X1 X2 /link=logit;
run;
Output
The LOGISTIC Procedure
Model Information
No
5
20
25
Data Set
Response Variable
Number of Response Levels
Number of Observations
Weight Variable
Sum of Weights
Link Function
Optimization Technique
WORK.CANCER
Y
2
6
wt
90
Logit
Fisher's scoring
Response Profile
Total
Y
Frequency
1
3
0
3
Ordered
Value
1
2
Total
Weight
40.000000
50.000000
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
125.653
98.258
SC
125.445
97.633
-2 Log L
123.653
92.258
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
31.3949
29.2500
23.3637
DF
2
2
2
Pr > ChiSq
<.0001
<.0001
<.0001
The LOGISTIC Procedure
Parameter
Intercept
X1
X2
Analysis of Maximum Likelihood Estimates
Standard
DF
Estimate
Error
Chi-Square
1
1
1
-1.6093
3.2186
0.9162
0.4899
0.6928
0.6245
Pr > ChiSq
10.7921
21.5842
2.1523
0.0010
<.0001
0.1424
Odds Ratio Estimates
Effect
X1
X2
Point
Estimate
95% Wald
Confidence Limits
24.994
2.500
6.429
0.735
97.170
8.500
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
33.3
33.3
33.3
9
Somers' D
Gamma
Tau-a
c
0.000
0.000
0.000
0.500
a) State the model.
b) Based on G 2 statistics, state whether the model is significant or not.
c) Find OR Obese vs Normal , and its confidence interval and interpret the result.
d) Find OR Overweight vs Normal and its confidence interval and interpret the result.
5. The objective of this study is to compare incidence of nonmelanoma skin cancer
among women in Minneapolis-St. Paul and Dallas-Ft. Worth. (15 pts)
SAS program
data skin1;
input agegroup $ city $ cases pop @@;
lpop = log(pop);
if agegroup='15-24' then age1=1; else age1=0;
if agegroup='25-34' then age2=1; else age2=0;
if agegroup='35-44' then age3=1; else age3=0;
if agegroup='45-54' then age4=1; else age4=0;
if agegroup='55-64' then age5=1; else age5=0;
if agegroup='65-74' then age6=1; else age6=0;
if agegroup='75-84' then age7=1; else age7=0;
if city='Dallas'
then city1=1; else city1=0;
lines;
15-24 Paul
1 172675
25-34 Paul 16 123065
35-44 Paul 30 96216
45-54 Paul 71 92051
55-64 Paul 102 72159
65-74 Paul 130 54722
75-84 Paul 133 32185
85+
Paul 40
8328
run;
15-24
25-34
35-44
45-54
55-64
65-74
75-84
85+
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
4
38
119
221
259
310
226
65
181343
146207
121374
111353
83004
55932
29007
7538
proc genmod;
model cases = age1 age2 age3 age4 age5 age6 age7 city1
/dist=poisson link = log
offset=lpop
/* to normalize the fitted cell
type1 type3;
run;
Output
The GENMOD Procedure
Model Information
Data Set
WORK.SKIN2
Distribution
Poisson
Link Function
Log
means */
Dependent Variable
Offset Variable
Observations Used
cases
lpop
16
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
Algorithm converged.
7
7
7
7
8.1950
8.1950
8.0626
8.0626
7201.8635
1.1707
1.1707
1.1518
1.1518
Using EXCEL,
the p-value for D
is .316.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
age1
age2
age3
age4
age5
age6
age7
city1
Scale
1
1
1
1
1
1
1
1
1
0
-5.4797
-6.1782
-3.5480
-2.3308
-1.5830
-1.0909
-0.5328
-0.1196
0.8043
1.0000
0.1037
0.4577
0.1675
0.1275
0.1138
0.1109
0.1086
0.1109
0.0522
0.0000
Wald 95% Confidence
Limits
-5.6828
-7.0753
-3.8763
-2.5807
-1.8061
-1.3083
-0.7457
-0.3371
0.7020
1.0000
-5.2765
-5.2810
-3.2197
-2.0810
-1.3599
-0.8735
-0.3199
0.0978
0.9066
1.0000
ChiSquare
Pr > ChiSq
2794.67
182.17
448.76
334.36
193.38
96.75
24.06
1.16
237.34
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
0.2809
<.0001
OTE: The scale parameter was held fixed.
LR Statistics For Type 1 Analysis
ChiSource
Deviance
DF
Square
Intercept
2790.3403
age1
1808.1517
1
982.19
age2
1115.2415
1
692.91
age3
708.1006
1
407.14
age4
455.6933
1
252.41
age5
306.8205
1
148.87
age6
268.0675
1
38.75
age7
266.9143
1
1.15
city1
8.1950
1
258.72
Pr > ChiSq
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
0.2829
<.0001
LR Statistics For Type 3 Analysis
ChiSource
DF
Square
Pr > ChiSq
age1
1
626.74
<.0001
age2
1
418.90
<.0001
age3
1
252.07
<.0001
age4
1
145.04
<.0001
age5
1
78.00
<.0001
age6
1
21.50
<.0001
age7
1
1.14
0.2862
city1
1
258.72
<.0001
a) State the model
b) Based on D statistics, state whether the model fits the data well or not.
c) Find RR Dallas vs Paul and interpret the result.
6. From the same context with Problems 1 & 2, subjects 1-8 came from Clinic A,
subjects 9 –15, from Clinic B, and subjects 16-20 from Clinic C were added
later. (10 pts)
Subject
1
2
3
4
5
6
7
8
Clinic A
1.69
2.22
3.07
3.35
3.00
2.74
3.61
5.14
Subject
9
10
11
12
13
14
15
Clinic B
2.44
4.17
2.42
2.94
3.04
4.62
4.42
Subject
16
17
18
19
20
title1 'Nonparametric problem 3';
data one;
input subject before after @@;
if subject < 9 then clinic="A";
if subject > 8 and subject < 16 then clinic="B";
if subject > 15 then clinic="C";
drop before;
cards;
1 1.69 1.69 9 2.58 2.44
2 2.77 2.22 10 1.84 4.17
3 1.00 3.07 11 1.89 2.42
4 1.66 3.35 12 1.91 2.94
5 3.00 3.00 13 1.75 3.04
6 0.85 2.74 14 2.46 4.62
7 1.42 3.61 15 2.35 4.42
8 2.82 5.14 16
.
2.34
17 .
3.17 18
.
4.42
19 .
4.94 20
.
5.04
;
run;
proc npar1way wilcoxon;
class clinic;
var after;
run;
Edited SAS output
The NPAR1WAY Procedure
Wilcoxon Scores (Rank Sums) for Variable after
Classified by Variable clinic
Sum of
Expected
Std Dev
Mean
clinic
N
Scores
Under H0
Under H0
Score
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
A
8
72.00
84.00
12.956608
9.000000
B
7
71.50
73.50
12.614684
10.214286
C
5
66.50
52.50
11.452131
13.300000
Average scores were used for ties.
Kruskal-Wallis Test
Chi-Square
1.6519
Clinic C
2.34
3.17
4.42
4.94
5.04
DF
Pr > Chi-Square
2
0.4378
Can one conclude that the FEV levels from these three groups are different? Let
  0.05 and find the p-value.
7. Assume that the following table came from the analysis of a randomized-blocks
design ANOVA. (10 pts)
Source
Treatments
df
4
SS
b
MS
e
F
5.00
Blocks
a
c
48.00
6.00
Error
20
d
f
Show your work on how you got the answers for a) – f).
Download