Data Screaming! Validating and Preparing your data Lyytinen & Gaskin Data Screening • Data screening (also known for us as “data screaming”) ensures your data is “clean” and ready to go before you conduct further your planned statistical analyses. • Data must always be screened to ensure the data is reliable, and valid for testing the type of causal theory your have planned for. • Screening and cooking are not synonymous – screening is like preparing the best ingredients for your gourmet food! Necessary Data Screening To Do: • Handle Missing Data • Address outliers and influentials • Meet multivariate statistical assumptions for alternative tests (scales, n, normality, covariance) Statistical Problems with Missing Data • If you are missing much of your data, this can cause several problems; e.g., can’t calculate the estimated model. • EFA, CFA, and path models require a certain minimum number of data points in order to compute estimates – each missing data point reduces your valid n by 1. • Greater model complexity (number of items, number of paths) and improved power require larger samples. Logical Problem with Missing Data • Missing data will indicate systematic bias because respondents may not have answered particular questions in your survey because of a common cause (poor formulation, sensitivity etc). • For example, if you ask about gender, and if females are less likely to report their gender than males, then you will have “male-biased” data. Perhaps only 50% of the females reported their gender, but 95% of the males reported gender. • If you use gender as moderator in your causal models, then you will be heavily biased toward males, because you will not end up using the unreported responses from females. You may also have biased sample from female respondents. Detecting Missing Values 1 3 2 Handling Missing Data • Missing more than 10% from a variable or respondent is typically not problematic (unless you lose specific items, or one end of the tail) • Method for handling missing data: – >10% - Just don't use that variable/respondent unless you go below acceptable n – <10% - Impute if not categorical – Warning: If you remove too many respondents, you will introduce response bias • If the DV is missing, then there is little you can do with that record • One alternative is to impute and run models with and without missing data to see how sensitive the result is Imputation Methods (Hair, table 2-2) • Use only valid data – No imputation, just use valid cases or variables – In SPSS: Exclude Pairwise (variable), Listwise (case) • Use known replacement values – Match missing value with similar case’s value • Use calculated replacement values – Use variable mean, median, or mode – Regression based on known relationships • Model based methods – Iterative two step estimation of value and descriptives to find most appropriate replacement value Mean Imputation in SPSS 2. Include each variable that has values that need imputing 2 1 3 4 3. For each variable you can choose the new name (for the imputed column) and the type of imputation Best Method – Prevention! • Short surveys (pre testing critical!) • Easy to understand and answer survey items (pre testing critical) • Force completion (incentives, technology) • Bribe/motivate (iPad drawing) • Digital surveys (rather than paper) • Put dependent variables at the beginning of the survey! Order for handling missing data 1. First decide which variables are going to be used in the model 2. Then handle missing data based on that set of variables 3. Then decide the method to handle missing data (see Hair Chapter 2) Outliers and Influentials • Outliers can influence your results, pulling the mean away from the median. • Outliers also affect distributional assumptions and often reflect false or mistaken responses • Two type of outliers: – outliers for individual variables (univariate) • Extreme values for a single variable – outliers for the model (multivariate) • Extreme (uncommon) values for a correlation Detecting Univariate Outliers Mean Outliers! 50% should fall within the box 99% should fall within this range Handling Univariate Outliers • Univariate outliers should be examined on a case by case basis. • If the outlier is truly abnormal, and not representative of your population, then it is okay to remove. But this requires careful examination of the data points – e.g., you are studying dogs, but somehow a cat got ahold of your survey – e.g., someone answered “1” for all 75 questions on the survey • However, just because a datapoint doesn’t fit comfortably with the distributions does not nominate that datapoint for removal Detecting Multivariate Outliers • Multivariate outliers refer to sets of data points (tuples) that do not fit the standard sets of correlations exhibited by the other data points in the dataset with regards to your causal model. • For example, if for all but one person in the dataset reports that diet has a positive effect on weight loss, but this one guy reports that he gains weight when he diets, then his record would be considered an outlier. • To detect these influential multivariate outliers, you need to calculate the Mahalanobis d-squared. (Easy in AMOS) These are row numbers from SPSS Anything less than .05 in the p1 column is abnormal, and is candidate for inspection Handling Multivariate Outliers • Create a new variable in SPSS called “Outlier” – Code 0 for Mahalanobis > .05 – Code 1 for Mahalanobis < .05 • I have a tool for this if you want… • Then in AMOS, when selecting data files, use “Outlier” as a grouping variable, with the grouping value set to 0 – This then runs your model with only non-outliers Before and after removing outliers N=340 N=295 BEFORE AFTER Even after you remove outliers, the Mahalanobis will come up with a whole new set of outliers, so these should be checked on a case by case basis, using the Mahalanobis as a guide for inspection. “Best Practice” for outliers • In general, it is a bad idea to remove outliers, unless they are truly “abnormal” and do not represent accurate observations from the population. The logic of removal needs to be based on semantics of the data • Removing outliers (especially en mass as demonstrated with the mahalanobis values) is risky because it decreases your ability to generalize as you do not know the cause of this type of variance, it may be more than just noise. Statistical Assumptions Part of data screening is ensuring you meet the four main statistical assumptions for multivariate data analysis: 1. Normality 2. Homoscedasticity 3. Linearity 4. Multicollinearity These assumptions are intended to hold for scalar and continuous variables, rather than categorical (we prefer gender to be bimodal) Normality • Normality refers to the distributional assumptions of a variable. • We usually assume in co-variance based models that the data is normally distributed, even though many times it is not! • Other tests like PLS or binomial regressions do not require such assumptions • t tests and F tests assume normal distributions • Normality is assessed in many ways: shape, skewness, and kurtosis (flat/peaked). • Normality issues effect small sample sizes (<50) much more than large sample sizes (>200) Bimodal Flat Shape Skewness Kurtosis Tests for Skewness and Kurtosis 1 2 • Relaxed rule: – Skewness > 1 = positive (right) skewed – Skewness < -1 = negative (left) skewed – Skewness between -1 and 1 is fine • Strict rule: – Abs(Skewness) > 3*Std. error = Skewed – Same for Kurtosis 3 Tests for Normality 1. 2. 3. 4. SPSS Analyze Explore Plots Normality *Neither of these variables would be considered normally distributed according to the KS or SW measures, but a visual inspection shows that role conflict (left) is roughly normal and participation (right) is positive skewed. So, ALWAYS conduct visual inspections! Fixing Normality Issues • Fix flat distribution with: – Inverse: 1/X • Fix negative skewed distribution with: – Squared: X*X – Cubed: X*X*X • Fix positive skewed distribution with: – Square root: SQRT(X) – Logarithm: LG10(X) Before and After Transformation Negative Skewed Cubed Homoscedasticity • Homoscedasticity is a nasty word that helps impress your listeners! • If a variable has this property it means that the DV exhibits consistent variance across different levels of the IV. • A simple way to determine if a relationship is homoscedastic, is to do a scatter plot with the IV on the x-axis and the DV on the y-axis. • If the plot comes up with a linear pattern, and has a substantial R-square we have homoscedasticity! • If there is not a linear pattern, and the R-square is low, then the relationship is heteroscedastic. Scatterplot approach Linearity • Linearity refers to the consistent slope of change that represents the relationship between an IV and a DV. • If the relationship between the IV and the DV is radically inconsistent, then it will throw off your SEM analyses as your data is not linear • Sometime you achieve this with transformations (log linear). Good Bad Multicollinearity • Multicollinearity is not desirable in regressions (but desirable in factor analysis!). • It means that independent variables are too highly correlated with each other and share too much variance • Influences the accuracy of estimates for DV and inflates error terms for DV (Hair). • How much unique variance does the black circle actually account for? Detecting Multicollinearity • An easy way to check this is to calculate a Variable Inflation Factor (VIF) for each independent variable after running a multivariate regression using one of the IVs as the dependent variable, and then regressing it on all the remaining IVs. Then swap out the IVs one at a time. • The rules of thumb for the VIF are as follows: – – – – VIF < 3; no problem VIF > 3; potential problem VIF > 5; very likely problem VIF > 10; definitely problem Handling Multicollinearity Loyalty 2 and loyalty 3 seem to be too similar in both of these test Dropping Loyalty 2 fixed the problem