EIPB 698D Lecture 5

advertisement
EIPB 698D Lecture 5
Raul Cruz-Cano
Spring 2013
Midterm Comments
• PROC MEANS VS. PROS SURVEYMEANS
• For non–parametric: Kriskal-Wallis
Proc Reg
• The REG procedure is one of many regression procedures in
the SAS System.
PROC REG < options > ;
MODEL dependents=<regressors> < / options > ;
BY variables ;
OUTPUT < OUT=SAS-data-set > keyword=names ;
data blood;
INFILE ‘F:\blood.txt';
INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol;
run;
data blood1; set blood;
if gender='Female' then sex=1; else sex=0;
if bloodtype='A' then typeA=1; else typeA=0;
if bloodtype='B' then typeB=1; else typeB=0;
if bloodtype='AB' then typeAB=1; else typeAB=0;
if age_group='Old' then Age_old=1; else Age_old=0;
run;
proc reg data =blood1;
model cholesterol =sex typeA typeB typeAB Age_old RBC WBC ;
run;
Proc reg output
Analysis of Variance
Source
DF
Model
7
Error
655
Corrected Total 662
Sum of
Mean
Squares
Square
F Value Pr > F
41237 5891.02895
2.54 0.0140
1521839 2323.41811
1563076
DF - These are the degrees of freedom associated with the sources of variance.
(1) The total variance has N-1 degrees of freedom (663-1=662).
(2) The model degrees of freedom corresponds to the number of predictors
minus 1 (P-1). Including the intercept, there are 8 predictors, so the model has
8-1=7 degrees of freedom.
(3) The Residual degrees of freedom is the DF total minus the DF model, 662-7
is 655.
Proc reg output
Analysis of Variance
Source
DF
Model
7
Error
655
Corrected Total 662
Sum of
Mean
Squares
Square
F Value Pr > F
41237
5891.02895
2.54 0.0140
1521839 2323.41811
1563076
Sum of Squares - associated with the three sources of variance, total, model and residual.
SSTotal
The total variability around the mean. Sum(Y - Ybar)2.
SSResidual The sum of squared errors in prediction. Sum(Y - Ypredicted)2.
SSModel The improvement in prediction by using the predicted value of Y over just
using the mean of Y. Hence, this would be the squared differences between
the predicted value of Y and the mean of Y, Sum (Ypredicted - Ybar)2.
Note that the SSTotal = SSModel + SSResidual. SSModel / SSTotal is equal to the value of
R-Square, the proportion of the variance explained by the independent variables
Proc reg output
Analysis of Variance
Source
DF
Model
7
Error
655
Corrected Total 662
Sum of
Mean
Squares
Square
F Value Pr > F
41237
5891.02895
2.54 0.0140
1521839 2323.41811
1563076
Mean Square - These are the Mean Squares, the Sum of Squares divided by
their respective DF. These are computed so you can compute the F ratio,
dividing the Mean Square Model by the Mean Square Residual to test
the significance of the predictors in the model
Proc reg output
Analysis of Variance
Source
DF
Model
7
Error
655
Corrected Total 662
Sum of
Mean
Squares
Square
F Value Pr > F
41237
5891.02895
2.54 0.0140
1521839 2323.41811
1563076
F Value and Pr > F - The F-value is the Mean Square Model divided by the Mean
Square Residual. F-value and P value are used to answer the question "Do the
independent variables predict the dependent variable?". The p-value is
compared to your alpha level (typically 0.05) and, if smaller, you can conclude
"Yes, the independent variables reliably predict the dependent variable". Note
that this is an overall significance test assessing whether the group of
independent variables when used together reliably predict the dependent
variable, and does not address the ability of any of the particular independent
variables to predict the dependent variable.
Proc reg output
Root MSE
Dependent Mean
Coeff Var
48.20185 R-Square
201.69683 Adj R-Sq
23.89817
Root MSE - Root MSE is the
standard deviation of the error
term, and is the square root of the
Mean Square Residual (or Error).
0.0264
0.0160
Proc reg output
Root MSE
Dependent Mean
Coeff Var
48.20185 R-Square
201.69683 Adj R-Sq
23.89817
Dependent Mean - This is
the mean of the
dependent variable.
0.0264
0.0160
How much
variability is
explained by
the model
Coeff Var - This is the
coefficient of variation, which
is a unit-less measure of
variation in the data. It is the
root MSE divided by the mean
of the dependent variable,
multiplied by 100:
(100*(48.2/201.69) =23.90).
Proc reg output
Parameter Estimates
Variable
DF
Parameter
Estimate
Intercept
sex
typeA
typeB
typeAB
Age_old
RBC
WBC
1
1
1
1
1
1
1
1
187.91927
1.48640
0.74839
10.14482
-19.90314
-11.61798
0.00264
0.20512
Standard
Error
t Value Pr > |t|
17.45409
3.79640
4.01841
6.97339
10.45833
3.85823
0.00191
1.88816
10.77 <.0001
0.39 0.6955
0.19 0.8523
1.45 0.1462
-1.90 0.0575
-3.01 0.0027
1.38 0.1676
0.11 0.9135
t Value and Pr > |t|These columns provide
the t-value and 2 tailed
p-value used in testing
the null hypothesis
that the
coefficient/parameter
is 0.
Another (better?) approach for weighted data
•
Experimental design data have all the properties that we learned about in statistics
classes.
– The data are going to be independent
– Identically-distributed observations with some known error distribution
– there is an underlying assumption that the data come to use as a finite number of
observations from a conceptually infinite population
– Simple random sampling without replacement for the sample data
•
Sample survey data,
– Does not come from a finite target population
– The sample survey data do not have independent errors. The sample survey data
do not come from a conceptually infinite population.
– The sample survey data may cover many small sub-populations, so we do not
expect that the errors are identically distributed.
12
Household Component of the Medical
Expenditure Panel Survey (MEPS HC)
• The MEPS HC is a nationally representative survey of the U.S.
civilian noninstitutionalized population.
• It collects medical expenditure data as well as information on
demographic characteristics, access to health care, health
insurance coverage, as well as income and employment data.
• MEPS is cosponsored by the Agency for Healthcare Research
and Quality (AHRQ) and the National Center for Health
Statistics (NCHS).
• For the comparisons reported here we used the MEPS 2005
Full Year Consolidated Data File (HC-097).
• This is a public use file available for download from the MEPS
web site (http://www.meps.ahrq.gov).
13
Transforming from SAS transport (SSP) format to
SAS Dataset (SAS7BDAT)
•
•
The MEPS is not a simple random sample, its design includes:
– Stratification
– Clustering
– Multiple stages of Selection
– Disproportionate sampling.
The MEPS public use files (such as HC-097) include variables for generating weighted national estimates
and for use of the Taylor method for variance estimation. These variables are:
– person-level weight (PERWT05F on HC-097)
– stratum (VARSTR on HC-097)
Needed for even better
– cluster/psu(VARPSU on HC-097).
estimates of the CI
LIBNAME PUFLIB 'C:\';
FILENAME IN1 'C:\H97.SSP';
PROC XCOPY IN=IN1 OUT=PUFLIB IMPORT;
RUN;
H97.SASBDAT occupies 408MB
vs. 257MB for H97.SSP vs.
14MB for H97.ZIP
14
PROC SURVEYFREQ Simple Example
SAS7BDAT
PROC SURVEYREG DATA= mylib.H97;
strata VARSTR;
cluster VARPSU;
model TTLP05X = SEX;
weight PERWT05F;
Run;
Predict Total Income Based on
Sex
15
Logistic regression
• For binary response models, the response, Y, of an individual
or an experimental unit can take on one of two possible
values, denoted for convenience by 1 and 0 (for example, Y=1
if a disease is present, otherwise Y=0). Suppose x is a vector of
explanatory variables and is the response probability to be
modeled. The logistic regression model has the form
Logit (P(Y=1)) =log (P(Y=1)/(1- P(Y=1)) = β0+ β1x
Proc logistic
The following statements are available in PROC LOGISTIC:
PROC LOGISTIC < options >;
BY variables ;
CLASS variable ;
MODEL response = < effects > < / options >;
MODEL events/trials = < effects > < / options >;
OUTPUT < OUT=SAS-data-set >
< keyword=name...keyword=name > / < option >;
The PROC LOGISTIC and MODEL statements are required;
only one MODEL statement can be specified. The CLASS
statement (if used) must precede the MODEL statement.
High school data
• The data were collected on 200 high school students, with
measurements on various tests, including science, math,
reading and science studies.
• The response variable is high writing test score (high_write),
where a writing score greater than or equal to 60 is
considered high, and less than 60 considered low;
• from which we explore its relationship with gender, reading
test score (read), and science test score (science).
High school data
data new ;
set d.hsb2;
if write>=60 then high_write=1; else high_write=0;
keep ID female math read science write high_write;
run;
proc logistic data= new descending;
model high_write = female read science;
run;
Logistic output
This the
data set
used in
this
procedure.
This is the type of
regression model
that was fit to our
data. The term
logit and logistic
are exchangeable.
Model Information
Data Set
WORK.NEW
Response Variable
high_write
Number of Response Levels 2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
Number of Observations Used
200
200
Logistic output
Response Profile
Ordered high_
Total
Value write
Frequency
1
2
1
0
53
147
Probability modeled is high_write=1.
This is a note informing
which level of the response
variable we are modeling.
Ordered value refers to
how SAS models the
levels of the dependent
variable. When we
specified the descending
option, SAS treats the
levels in a descending
order (high to low), such
that when the regression
coefficients are
estimated, a positive
coefficient corresponds
to a positive relationship
for high write status. By
default SAS models the
lower level
Logistic output
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model with no
predictors just
intercept tem
These are
various
measurements
used to assess
the model fit.
The smaller
values the
better fit.
Model Fit Statistics
Criterion
Intercept
Only
AIC
SC
-2 Log L
233.289
236.587
231.289
Intercept
and
Covariates
168.236
181.430
160.236
This describes
whether the
maximumlikehood
algorithm has
converged or not,
and what kind of
convergence
criterion is used
to asses
convergence.
The fitted model
Logistic output
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
DF
Pr > ChiSq
71.0525
58.6092
39.8751
3
3
3
<.0001
<.0001
<.0001
These are three asymptotically equivalent Chi-Square tests. They
test against the null hypothesis that all of the predictors'
regression coefficient are equal to zero in the model. With
P<0.001, we will reject Ho and conclude that at least one of the
predictors' regression coefficient is not equal to zero.
Logistic output
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate
Error Chi-Square Pr > ChiSq
Intercept 1 -12.7772 1.9759
41.8176
<.0001
female
1 1.4825
0.4474
10.9799
0.0009
read
1 0.1035
0.0258
16.1467
<.0001
science
1 0.0948
0.0305
9.6883
0.0019
Here are the parameter estimates along with their P-value. Base on the estimates,
our model is log[ p / (1-p) ] = -12.78 + 1.48*female + 0.10*read + 0.09*science.
Logistic output
Odds Ratio Estimates
Point
95% Wald
Effect Estimate Confidence Limits
female
4.404
1.832 10.584
read
1.109
1.054
1.167
science
1.099
1.036
1.167
The odds ratio is obtained by exponentiating the Estimate, exp[Estimate]. We can
interpret the odds ratio as follows: for a one unit change in the predictor variable, the
odds ratio for a positive outcome is expected to change by the respective coefficient,
given the other variables in the model are held constant.
Logistic output
Odds Ratio Estimates
Point
95% Wald
Effect Estimate Confidence Limits
female
4.404
1.832 10.584
read
1.109
1.054
1.167
science
1.099
1.036
1.167
If the 95% CI
does not cover
1, it suggests
the estimate is
statistically
significant
Weighted Example
• Just as with linear regression, logistic regression allows you to look at the
effect of multiple predictors on an outcome.
• Consider the following example: 15- and 16-year-old adolescents were asked
if they have ever had sexual intercourse.
– The outcome of interest is intercourse.
– The predictors are race (white and black) and gender (male and female).
Example from Agresti, A. Categorical Data Analysis, 2nd ed. 2002.
Here is a table of the data:
Intercourse
Race
Gender
Yes
No
White
Male
43
134
Female
26
149
Male
29
23
Female
22
36
Black
Raul Cruz-Cano, HLTH653 Spring 2013
Data Set Intercourse
DATA intercourse;
INPUT white male intercourse count;
DATALINES;
1 1 1 43
1 1 0 134
1 0 1 26
1 0 0 149
0 1 1 29
0 1 0 23
0 0 1 22
0 0 0 36
;
RUN;
SAS:
PROC LOGISTIC DATA = intercourse descending;
weight count;
MODEL intercourse = white male/rsquare lackfit;
RUN;
• “descending” models the probability that intercourse = 1 (yes)
rather than = 0 (no).
• “rsquare” requests the R2 value from SAS; it is interpreted the
same way as the R2 from linear regression.
• “lackfit” requests the Hosmer and Lemeshow Goodness-of-Fit
Test. This tells you if the model you have created is a good fit
for the data.
SAS Output: R2
Interpreting the R2 value
The R2 value is 0.9907. This means that 99.07%
of the variability in our outcome (intercourse)
is explained by including gender and race in
our model.
PROC LOGISTIC Output
The odds of having intercourse is 1.911 times greater
for males versus females.
Hosmer and Lemeshow GOF Test
H-L GOF Test
The Hosmer and Lemeshow Goodness-of-Fit Test tests the hypotheses:
Ho: the model is a good fit, vs. Ha: the model is NOT a good fit
With this test, we want to FAIL to reject the null hypothesis, because that means our
model is a good fit (this is different from most of the hypothesis testing you have
seen).
Look for a p-value > 0.10 in the H-L GOF test. This indicates the model is a good fit.
In this case, the pvalue = 0.2419, so we do NOT reject the null hypothesis, and we
conclude the model is a good fit.
Model Selection in SAS
• Can be applied to both Linear and Logistic Models
• Often, if you have multiple predictors and interactions in your model, SAS
can systematically select significant predictors using forward selection,
backwards selection, or stepwise selection.
• In forward selection, SAS starts with no predictors in the model. It then
selects the predictor with the smallest pvalue and adds it to the model. It
then selects another predictor from the remaining variables with the
smallest pvalue and adds it to the model. It continues doing this until no
more predictors have pvalues less than 0.05.
• In backwards selection, SAS starts with all of the predictors in the model
and eliminates the non-significant predictors one at a time, refitting the
model between each elimination. It stops once all the predictors
remaining in the model are statistically significant.
Forward Selection in SAS
We will let SAS select a model for us out of the three
predictors: white, male, white*male. Type the
following code into SAS:
PROC LOGISTIC DATA = intercourse descending;
weight count;
MODEL intercourse = white male white*male/selection = forward lackfit;
RUN;
Output from Forward Selection: “white” is added to
the model
“male” is added to the model
No more predictors are found to be statistically significant
The Final Model:
Hosmer and Lemeshow GOF Test: The model is a good fit
SAS Weigted vs. Survey Procedures
•
•
A random sample
300 students from each of the classes: freshman, sophomore, junior, and senior classes.
proc format;
value Design 1='A' 2='B' 3='C';
value Rating 1='dislike very much'
2='dislike'
3='neutral'
4='like'
5='like very much';
value Class 1='Freshman' 2='Sophomore'
3='Junior' 4='Senior';
run;
data Enrollment;
format Class Class.;
input Class _TOTAL_;
datalines;
1 3734
2 3565
3 3903
4 4196
;
run;
data WebSurvey;
format Class Class. Design Design. Rating Rating. ;
do Class=1 to 4;
do Design=1 to 3;
do Rating=1 to 5;
input Count @@;
output;
end;
end;
end;
datalines;
10 34 35 16 15 8 21 23 26 22 5 10 24 30 21
1 14 25 23 37 11 14 20 34 21 16 19 30 23 12
19 12 26 18 25 11 14 24 33 18 10 18 32 23 17
8 15 35 30 12 15 22 34 9 20 2 34 30 18 16
;
run;
data WebSurvey;
set WebSurvey;
if Class=1 then Weight=3734/300;
if Class=2 then Weight=3565/300;
if Class=3 then Weight=3903/300;
if Class=4 then Weight=4196/300;
run;
PROC Logistic
proc logistic data=WebSurvey;
freq Count;
class Design;
model Rating (ref='neutral') = Design ;
weight Weight;
run;
PROC surveylogistic
If you want “better” results..
proc surveylogistic data=WebSurvey total=Enrollment;
freq Count;
class Design;
model Rating (ref='neutral') = Design;
stratum Class;
weight Weight;
run;
For the Ratings for Design B vs. Design C compare
1. The point estimate
2. 95% Confidence Interval
Download