Screening the Data Tedious but essential! Missing Data • Missing Not at Random (MNAR) • Missing at Random (MAR) • Missing Completely at Random (MCAR) Missing Not at Random (MNAR) • Are missing cases on Y • Missingness is related to the value of Y • Faculty salaries – those with high salaries may be reluctant to reveal them • Estimates of mean Y will be biased if use just the available data Missing at Random (MAR) • Missingness on Y not related to value of Y • Or is related but through other variables on which we have data. • Faculty salary related to rank. • Higher rank = higher salary • If missingness is random within each rank, within-rank estimates will be unbiased. • Overall mean = weighted sum of withinrank estimates Missing Completely at Random (MCAR) • There is no variable, observed or not, that is related to missingness of Y. • Ideal, not likely ever absolutely true. Finding Patterns of Missingness • There is specialized software. You do not have it. • Can use SAS. • Can use SPSS with home license code. • Create missingness dummy variable • 0 = not missing, 1 = missing • Relate missingness to other variables. Dealing with MCAR Data • Delete Cases: Will create no bias, but will lower power and precision. • Mean Substitution: For each missing value, substitute the group mean on that value. No bias for means, but will reduce standard deviations. Dealing with MCAR Data • Regression: For each missing score, develop a multiple regression to predict score from other variables. Impute that predicted score. Regression towards mean will reduce variability. Dealing with MAR Data • Deletion of Variables: If another variable can serve as a proxy. • Multiple Imputation – specialized software, may eliminate bias – Involves resampling techniques to generate several sets of predictions of missing scores – Analyze each set and then average the results across sets. Dealing with MNAR Data • Sophisticated methods may reduce, but not eliminate, bias. • Pairwise Correlation Matrix – use as input to multivariate procedures. Different correlations will be based on different subsets of the data. Can produce very strange results, not recommended. Missing Item Data Within Unidimensional Scale • Assume each item measures the same construct. • For each subject, compute the means on the items which do have data. • Set to missing the scale scores for subjects who have answered fewer than a threshold number of items. Identifying Outliers • Univariate: Box and whiskers plots • Multivariate: Compute Mahalanobis Distance or Leverage. Investigate cases with high values. Use outlier dummy variable to compare outliers with inliers. • Regression Diagnostics: o Leverage: Cases with unusual values on the predictor variables Outliers o Standardized Residuals: Cases whose actual Y is far from predicted Y. o Cook’s D: Cases with values that make them have great influence on the regression solution. Dealing with Outliers • Investigate: May be bad data. May be able to correct the data, may not. May represent cases not properly considered part of the population of interest. • Out-of-Range Values: Even if not outliers, these are bad data that need correction. Dealing with Outliers • Set to Missing: If all else fails. • Delete the Case: For example, if convinced the respondent was not even reading the questions. – “I frequently visit planets outside of our solar system.” – “I make all of my own clothes.” • Delete the Variable: Last resort when it has many cases with missing data. Dealing with Outliers • Transform the Variable: If outliers are valid but contributing to skewness. • Change the Score: For example, reduce very high score to value a small bit higher than the remaining highest score. See Howell’s discussion of “Winsorizing.” Assumptions of the Analysis • Check Outliers First: Dealing with outliers may resolve the problems below. • Normality: Look at plots and measures of skewness and kurtosis. Ignore tests of significance, like Kolgomorov-Smirnov. May need to use different analysis. • Homogeneity of Variance: Does the variance differ considerably across groups? May need to transform or use different analysis. Assumptions of the Analysis • Homoscedasticity: Carefully inspect the residuals. May need to transform data or use a different analysis. • Homogeneity of Variance/Covariance Matrices (across groups): Box’s M. • Sphericity: For univariate-approach related samples ANOVA. Check with Mauchley’s Test. Correct the df or use a multivariate approach instead. Assumptions of the Analysis • Homogeneity of Regression: In ANCOV, we assume the relationship between Y and the predictors is constant across groups. Test the Groups x Predictor(s) interactions. • Linear Relationships: Look at plots. If necessary, transform variables or use curvilinear techniques. Multicollinearity • One predictor is nearly perfectly correlated with the other predictors. • Makes the regression coefficients unstable across random samples from the same population. • Makes complicated the interpretation of unique effects. Detecting Multicollinearity • For each predictor, compute the R2 between it and the other predictors. If very high (.9 or more), there is a problem. • SAS will compute tolerance = (1 – that R2 ). If very low, there is a problem. • If R2 = 1, the correlation matrix is singulair, cannot be inverted, the analysis crashes – Predictors = Verbal SAT, Math SAT, Total SAT. Variance Inflation Factor • VIF = 1/tolerance. If high, there is a problem. • How High? • Some say 10, some say 5, a few say 2.5. • If R2 = .9, tolerance = .1, VIF = 10. Dealing with Multicollinearity • Drop a Predictor – may resolve the problem. • Combine Predictors – into a composite variable • Principle Components Analysis – conduct the analysis on the resulting weighed linear combinations of the variables. Can then transform the results back to the original variables. SAS 1 • Look at the command lines in the SAS program. • Always give every case a unique ID number, so you can locate it later. • Label variables if their SAS name is not informative. • input ID 1-3 @5 (Q1-Q138) (1.); label Q1='Sex' Q3 = 'Age'; SAS 2 • Recode values that represent missing data. • On several variables, such as “number of biological brothers,” response 5 was “do not know.” • if Q15 = 5 then Q15 = . ; if Q16 = 5 then q16 = . ; SAS 3 & 4 • Transform variable to reduce positive skewness • age_sr = sqrt(Q3); age_log = log10(Q3); age_inv = -1/(Q3); • Dichotomize variable – transformation of last resort. • if q3 = 1 then age_di = 1; else if q3 > 1 then age_di = 2; SAS 5 & 6 • Create composite variable • SIBS = Q15 + Q16; • Transform to reduce positive skewness • sibs_sr = sqrt(sibs); sibs_log = log10(sibs); sibs_in = -1/sibs; SAS 7 • Create mental variable and associated missingness variable. • MENTAL = Q62 + Q65 + Q67; MentalMiss = 0; If Mental=.then MentalMiss = 1; SAS 8 • Transform to reduce negative skewness • Mental2 = Mental*Mental; Mental3 = Mental**3; Ment_exp = EXP(Mental); R_Ment = 13 - Mental; R_Ment_sr = sqrt(R_Ment); R_Ment_log = log10(R_Ment); SAS 9 • Dichotomize Mental • if 0 LE Mental LE 9 then Ment_di=1; else if Mental > 9 then Ment_di=2; • Be careful – SAS codes missing data with an extreme negative number. SAS 10 • Check for missing data and out-of-range values. • proc means min max n nmiss; var q1-q10 q50-q70; run; SAS 11 • Check for skewness & kurtosis • proc means min max n nmiss skewness kurtosis; var Q3 age_sr -- Mental Mental2 -- R_Ment_log; run; SAS 12 • Check distributions of variables with few values • proc freq; tables q3 age_di sibs mental ment_di; run; SAS 13 • Locate cases with bad data • data duh; set delmel; if q9 > 3; proc print; var q9; id id; run; • Case 159 has out-of-range on item Q9. SAS 14 • Check correlates of missingness. • proc corr nosimple data=delmel; var MentalMiss; with Q1 Q3 Q5 Q6 sibs; run; • MentalMiss negatively correlated with sibs. • Duh, some subjects have missing data on number of brothers or number of sisters. • Instead of Mental = Q62+Q65+Q67, use Mental = Mean(of Q62 Q65 Q67); SAS 15 • Identify multivariate outliers • proc reg data=delmel; model id = Q1 Q3 Q6 mental; output out=hat H=Leverage; run; data outliers; set hat; if leverage > .052; SAS 15 • Identify multivariate outliers • proc print; var id Q1 Q3 Q6 mental leverage; run; proc means mean; var Q1 Q3 Q6 mental; run; • As a group, the outliers are older than the overall sample. • All three students aged 25 or older are included among the outliers. Survey Scoundrels • These sloths do not even read the questions, they just answer randomly to get whatever incentive is available for completing the survey. • My daughter’s shock upon discovering this. • Monitor how long it takes respondents to complete the survey. Items to Help Detect Scoundrels • Repeat same item, compare responsese • “I frequently visit with aliens from other planets.” • “I make all of my own clothes.”