EIPB 698D Lecture 5 Raul Cruz-Cano Spring 2013 Midterm Comments • PROC MEANS VS. PROS SURVEYMEANS • For non–parametric: Kriskal-Wallis Proc Reg • The REG procedure is one of many regression procedures in the SAS System. PROC REG < options > ; MODEL dependents=<regressors> < / options > ; BY variables ; OUTPUT < OUT=SAS-data-set > keyword=names ; data blood; INFILE ‘F:\blood.txt'; INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol; run; data blood1; set blood; if gender='Female' then sex=1; else sex=0; if bloodtype='A' then typeA=1; else typeA=0; if bloodtype='B' then typeB=1; else typeB=0; if bloodtype='AB' then typeAB=1; else typeAB=0; if age_group='Old' then Age_old=1; else Age_old=0; run; proc reg data =blood1; model cholesterol =sex typeA typeB typeAB Age_old RBC WBC ; run; Proc reg output Analysis of Variance Source DF Model 7 Error 655 Corrected Total 662 Sum of Mean Squares Square F Value Pr > F 41237 5891.02895 2.54 0.0140 1521839 2323.41811 1563076 DF - These are the degrees of freedom associated with the sources of variance. (1) The total variance has N-1 degrees of freedom (663-1=662). (2) The model degrees of freedom corresponds to the number of predictors minus 1 (P-1). Including the intercept, there are 8 predictors, so the model has 8-1=7 degrees of freedom. (3) The Residual degrees of freedom is the DF total minus the DF model, 662-7 is 655. Proc reg output Analysis of Variance Source DF Model 7 Error 655 Corrected Total 662 Sum of Mean Squares Square F Value Pr > F 41237 5891.02895 2.54 0.0140 1521839 2323.41811 1563076 Sum of Squares - associated with the three sources of variance, total, model and residual. SSTotal The total variability around the mean. Sum(Y - Ybar)2. SSResidual The sum of squared errors in prediction. Sum(Y - Ypredicted)2. SSModel The improvement in prediction by using the predicted value of Y over just using the mean of Y. Hence, this would be the squared differences between the predicted value of Y and the mean of Y, Sum (Ypredicted - Ybar)2. Note that the SSTotal = SSModel + SSResidual. SSModel / SSTotal is equal to the value of R-Square, the proportion of the variance explained by the independent variables Proc reg output Analysis of Variance Source DF Model 7 Error 655 Corrected Total 662 Sum of Mean Squares Square F Value Pr > F 41237 5891.02895 2.54 0.0140 1521839 2323.41811 1563076 Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF. These are computed so you can compute the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the significance of the predictors in the model Proc reg output Analysis of Variance Source DF Model 7 Error 655 Corrected Total 662 Sum of Mean Squares Square F Value Pr > F 41237 5891.02895 2.54 0.0140 1521839 2323.41811 1563076 F Value and Pr > F - The F-value is the Mean Square Model divided by the Mean Square Residual. F-value and P value are used to answer the question "Do the independent variables predict the dependent variable?". The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable. Proc reg output Root MSE Dependent Mean Coeff Var 48.20185 R-Square 201.69683 Adj R-Sq 23.89817 Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error). 0.0264 0.0160 Proc reg output Root MSE Dependent Mean Coeff Var 48.20185 R-Square 201.69683 Adj R-Sq 23.89817 Dependent Mean - This is the mean of the dependent variable. 0.0264 0.0160 How much variability is explained by the model Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of the dependent variable, multiplied by 100: (100*(48.2/201.69) =23.90). Proc reg output Parameter Estimates Variable DF Parameter Estimate Intercept sex typeA typeB typeAB Age_old RBC WBC 1 1 1 1 1 1 1 1 187.91927 1.48640 0.74839 10.14482 -19.90314 -11.61798 0.00264 0.20512 Standard Error t Value Pr > |t| 17.45409 3.79640 4.01841 6.97339 10.45833 3.85823 0.00191 1.88816 10.77 <.0001 0.39 0.6955 0.19 0.8523 1.45 0.1462 -1.90 0.0575 -3.01 0.0027 1.38 0.1676 0.11 0.9135 t Value and Pr > |t|These columns provide the t-value and 2 tailed p-value used in testing the null hypothesis that the coefficient/parameter is 0. Another (better?) approach for weighted data • Experimental design data have all the properties that we learned about in statistics classes. – The data are going to be independent – Identically-distributed observations with some known error distribution – there is an underlying assumption that the data come to use as a finite number of observations from a conceptually infinite population – Simple random sampling without replacement for the sample data • Sample survey data, – Does not come from a finite target population – The sample survey data do not have independent errors. The sample survey data do not come from a conceptually infinite population. – The sample survey data may cover many small sub-populations, so we do not expect that the errors are identically distributed. 12 Household Component of the Medical Expenditure Panel Survey (MEPS HC) • The MEPS HC is a nationally representative survey of the U.S. civilian noninstitutionalized population. • It collects medical expenditure data as well as information on demographic characteristics, access to health care, health insurance coverage, as well as income and employment data. • MEPS is cosponsored by the Agency for Healthcare Research and Quality (AHRQ) and the National Center for Health Statistics (NCHS). • For the comparisons reported here we used the MEPS 2005 Full Year Consolidated Data File (HC-097). • This is a public use file available for download from the MEPS web site (http://www.meps.ahrq.gov). 13 Transforming from SAS transport (SSP) format to SAS Dataset (SAS7BDAT) • • The MEPS is not a simple random sample, its design includes: – Stratification – Clustering – Multiple stages of Selection – Disproportionate sampling. The MEPS public use files (such as HC-097) include variables for generating weighted national estimates and for use of the Taylor method for variance estimation. These variables are: – person-level weight (PERWT05F on HC-097) – stratum (VARSTR on HC-097) Needed for even better – cluster/psu(VARPSU on HC-097). estimates of the CI LIBNAME PUFLIB 'C:\'; FILENAME IN1 'C:\H97.SSP'; PROC XCOPY IN=IN1 OUT=PUFLIB IMPORT; RUN; H97.SASBDAT occupies 408MB vs. 257MB for H97.SSP vs. 14MB for H97.ZIP 14 PROC SURVEYFREQ Simple Example SAS7BDAT PROC SURVEYREG DATA= mylib.H97; strata VARSTR; cluster VARPSU; model TTLP05X = SEX; weight PERWT05F; Run; Predict Total Income Based on Sex 15 Logistic regression • For binary response models, the response, Y, of an individual or an experimental unit can take on one of two possible values, denoted for convenience by 1 and 0 (for example, Y=1 if a disease is present, otherwise Y=0). Suppose x is a vector of explanatory variables and is the response probability to be modeled. The logistic regression model has the form Logit (P(Y=1)) =log (P(Y=1)/(1- P(Y=1)) = β0+ β1x Proc logistic The following statements are available in PROC LOGISTIC: PROC LOGISTIC < options >; BY variables ; CLASS variable ; MODEL response = < effects > < / options >; MODEL events/trials = < effects > < / options >; OUTPUT < OUT=SAS-data-set > < keyword=name...keyword=name > / < option >; The PROC LOGISTIC and MODEL statements are required; only one MODEL statement can be specified. The CLASS statement (if used) must precede the MODEL statement. High school data • The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and science studies. • The response variable is high writing test score (high_write), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; • from which we explore its relationship with gender, reading test score (read), and science test score (science). High school data data new ; set d.hsb2; if write>=60 then high_write=1; else high_write=0; keep ID female math read science write high_write; run; proc logistic data= new descending; model high_write = female read science; run; Logistic output This the data set used in this procedure. This is the type of regression model that was fit to our data. The term logit and logistic are exchangeable. Model Information Data Set WORK.NEW Response Variable high_write Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read Number of Observations Used 200 200 Logistic output Response Profile Ordered high_ Total Value write Frequency 1 2 1 0 53 147 Probability modeled is high_write=1. This is a note informing which level of the response variable we are modeling. Ordered value refers to how SAS models the levels of the dependent variable. When we specified the descending option, SAS treats the levels in a descending order (high to low), such that when the regression coefficients are estimated, a positive coefficient corresponds to a positive relationship for high write status. By default SAS models the lower level Logistic output Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model with no predictors just intercept tem These are various measurements used to assess the model fit. The smaller values the better fit. Model Fit Statistics Criterion Intercept Only AIC SC -2 Log L 233.289 236.587 231.289 Intercept and Covariates 168.236 181.430 160.236 This describes whether the maximumlikehood algorithm has converged or not, and what kind of convergence criterion is used to asses convergence. The fitted model Logistic output Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 71.0525 58.6092 39.8751 3 3 3 <.0001 <.0001 <.0001 These are three asymptotically equivalent Chi-Square tests. They test against the null hypothesis that all of the predictors' regression coefficient are equal to zero in the model. With P<0.001, we will reject Ho and conclude that at least one of the predictors' regression coefficient is not equal to zero. Logistic output Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -12.7772 1.9759 41.8176 <.0001 female 1 1.4825 0.4474 10.9799 0.0009 read 1 0.1035 0.0258 16.1467 <.0001 science 1 0.0948 0.0305 9.6883 0.0019 Here are the parameter estimates along with their P-value. Base on the estimates, our model is log[ p / (1-p) ] = -12.78 + 1.48*female + 0.10*read + 0.09*science. Logistic output Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female 4.404 1.832 10.584 read 1.109 1.054 1.167 science 1.099 1.036 1.167 The odds ratio is obtained by exponentiating the Estimate, exp[Estimate]. We can interpret the odds ratio as follows: for a one unit change in the predictor variable, the odds ratio for a positive outcome is expected to change by the respective coefficient, given the other variables in the model are held constant. Logistic output Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female 4.404 1.832 10.584 read 1.109 1.054 1.167 science 1.099 1.036 1.167 If the 95% CI does not cover 1, it suggests the estimate is statistically significant Weighted Example • Just as with linear regression, logistic regression allows you to look at the effect of multiple predictors on an outcome. • Consider the following example: 15- and 16-year-old adolescents were asked if they have ever had sexual intercourse. – The outcome of interest is intercourse. – The predictors are race (white and black) and gender (male and female). Example from Agresti, A. Categorical Data Analysis, 2nd ed. 2002. Here is a table of the data: Intercourse Race Gender Yes No White Male 43 134 Female 26 149 Male 29 23 Female 22 36 Black Raul Cruz-Cano, HLTH653 Spring 2013 Data Set Intercourse DATA intercourse; INPUT white male intercourse count; DATALINES; 1 1 1 43 1 1 0 134 1 0 1 26 1 0 0 149 0 1 1 29 0 1 0 23 0 0 1 22 0 0 0 36 ; RUN; SAS: PROC LOGISTIC DATA = intercourse descending; weight count; MODEL intercourse = white male/rsquare lackfit; RUN; • “descending” models the probability that intercourse = 1 (yes) rather than = 0 (no). • “rsquare” requests the R2 value from SAS; it is interpreted the same way as the R2 from linear regression. • “lackfit” requests the Hosmer and Lemeshow Goodness-of-Fit Test. This tells you if the model you have created is a good fit for the data. SAS Output: R2 Interpreting the R2 value The R2 value is 0.9907. This means that 99.07% of the variability in our outcome (intercourse) is explained by including gender and race in our model. PROC LOGISTIC Output The odds of having intercourse is 1.911 times greater for males versus females. Hosmer and Lemeshow GOF Test H-L GOF Test The Hosmer and Lemeshow Goodness-of-Fit Test tests the hypotheses: Ho: the model is a good fit, vs. Ha: the model is NOT a good fit With this test, we want to FAIL to reject the null hypothesis, because that means our model is a good fit (this is different from most of the hypothesis testing you have seen). Look for a p-value > 0.10 in the H-L GOF test. This indicates the model is a good fit. In this case, the pvalue = 0.2419, so we do NOT reject the null hypothesis, and we conclude the model is a good fit. Model Selection in SAS • Can be applied to both Linear and Logistic Models • Often, if you have multiple predictors and interactions in your model, SAS can systematically select significant predictors using forward selection, backwards selection, or stepwise selection. • In forward selection, SAS starts with no predictors in the model. It then selects the predictor with the smallest pvalue and adds it to the model. It then selects another predictor from the remaining variables with the smallest pvalue and adds it to the model. It continues doing this until no more predictors have pvalues less than 0.05. • In backwards selection, SAS starts with all of the predictors in the model and eliminates the non-significant predictors one at a time, refitting the model between each elimination. It stops once all the predictors remaining in the model are statistically significant. Forward Selection in SAS We will let SAS select a model for us out of the three predictors: white, male, white*male. Type the following code into SAS: PROC LOGISTIC DATA = intercourse descending; weight count; MODEL intercourse = white male white*male/selection = forward lackfit; RUN; Output from Forward Selection: “white” is added to the model “male” is added to the model No more predictors are found to be statistically significant The Final Model: Hosmer and Lemeshow GOF Test: The model is a good fit SAS Weigted vs. Survey Procedures • • A random sample 300 students from each of the classes: freshman, sophomore, junior, and senior classes. proc format; value Design 1='A' 2='B' 3='C'; value Rating 1='dislike very much' 2='dislike' 3='neutral' 4='like' 5='like very much'; value Class 1='Freshman' 2='Sophomore' 3='Junior' 4='Senior'; run; data Enrollment; format Class Class.; input Class _TOTAL_; datalines; 1 3734 2 3565 3 3903 4 4196 ; run; data WebSurvey; format Class Class. Design Design. Rating Rating. ; do Class=1 to 4; do Design=1 to 3; do Rating=1 to 5; input Count @@; output; end; end; end; datalines; 10 34 35 16 15 8 21 23 26 22 5 10 24 30 21 1 14 25 23 37 11 14 20 34 21 16 19 30 23 12 19 12 26 18 25 11 14 24 33 18 10 18 32 23 17 8 15 35 30 12 15 22 34 9 20 2 34 30 18 16 ; run; data WebSurvey; set WebSurvey; if Class=1 then Weight=3734/300; if Class=2 then Weight=3565/300; if Class=3 then Weight=3903/300; if Class=4 then Weight=4196/300; run; PROC Logistic proc logistic data=WebSurvey; freq Count; class Design; model Rating (ref='neutral') = Design ; weight Weight; run; PROC surveylogistic If you want “better” results.. proc surveylogistic data=WebSurvey total=Enrollment; freq Count; class Design; model Rating (ref='neutral') = Design; stratum Class; weight Weight; run; For the Ratings for Design B vs. Design C compare 1. The point estimate 2. 95% Confidence Interval