Statistics – Spring 2008 Lab #1 – Data Screening The purpose of data screening is to: (a) check if data have been entered correctly, such as out-of-range values. (b) check for missing values, and deciding how to deal with the missing values. (c) check for outliers, and deciding how to deal with outliers. (d) check for normality, and deciding how to deal with non-normality. 1. Finding incorrectly entered data Your first step with “Data Screening” is using “Frequencies” 1. Select Analyze --> Descriptive Statistics --> Frequencies 2. Move all variables into the “Variable(s)” window. 3. Click OK. Output below is for only the four “system” variables in our dataset because copy/pasting the output for all variables in our dataset would take up too much space in this document. The “Statistics” box tells you the number of missing values for each variable. We will use this information later when we are discussing missing values. Each variable is then presented as a frequency table. For example, below we see the output for “system1”. By looking at the coding manual for the “Legal beliefs” survey, you can see that the available responses for “system1” are 1 through 11. By looking at the output below, you can see that there is a number out-of-range: “13”. (NOTE – in your dataset there will not be a “13” because I gave you the screened dataset, so I have included the “13” into this example to show you what it looks like when a number is out of range.) Since 13 is an invalid number, you then need to identify why “13” was entered. For example, did the person entering data make a mistake? Or, did the subject respond with a “13” even though the question indicated that only numbers 1 through 11 are valid? You can identify the source of the error by looking at the hard copies of the data. For example, first identify which subject indicated the “13” by clicking on the variable name to highlight it (system1), and then using the “find” function by: Edit --> Find, and then scrolling to the left to identify the subject number. Then, hunt down the hard copy of the data for that subject number. 1 2. Missing Values Below, I describe in-depth how to identify and deal with missing values. Why do missing values occur? Missing values are either random or non-random. Random missing values may occur because the subject inadvertently did not answer some questions. For example, the study may be overly complex and/or long, or the subject may be tired and/or not paying attention, and miss the question. Random missing values may also occur through data entry mistakes. Non-random missing values may occur because the subject purposefully did not answer some questions. For example, the question may be confusing, so many subjects do not answer the question. Also, the question may not provide appropriate answer choices, such as “no opinion” or “not applicable”, so the subject chooses not to answer the question. Also, subjects may be reluctant to answer some questions because of social desirability concerns about the content of the question, such as questions about sensitive topics like past crimes, sexual history, prejudice or bias toward certain groups, and etc. Why is missing data a problem? Missing values means reduced sample size and loss of data. You conduct research to measure empirical reality so missing values thwart the purpose of research. Missing values may also indicate bias in the data. If the missing values are non-random, then the study is not accurately measuring the intended constructs. The results of your study may have been different if the missing data was not missing. How do I identify missing values? 1. Select Analyze --> Descriptive Statistics --> Frequencies 2. Move all variables into the “Variable(s)” window. 3. Click OK. Output below is for only the four “system” variables in our dataset because copy/pasting the output for all variables in our dataset would take up too much space in this document. The “Statistics” box tells you the number of missing values for each variable. How do I identify if missing values are random or non-random? First, if there are only a small number of missing values, then it is extremely unlikely to be non-random. For example, “system3” has only 2 missing values, so 2 people out of 327 total subjects could not be “non-random”. Second, even if there are a larger number of missing values, that does not necessarily mean the missing values are non-random. You should look to the question itself to identify if it is poorly constructed or engenders social desirability concerns. Third, some questions will always have large number of missing values because of the way the question is designed. For example, the “threshold3” questions in our dataset ask the subjects to “mark all answers that apply”, so there will be a lot of missing data because some options are chosen less frequently than others. Fourth, SPSS has an add-on module called “Missing Values Analysis” that will statistically test whether missing values are random or non-random. The add-on is included in your copy of SPSS, but most people do not have the add-on module. It is not even offered on the versions of SPSS at USC. Given how unlikely non-random values occurs in datasets, I know no one who conducts this analysis. However, if you do want to conduct missing values analysis using the SPSS add-on, then you can access it by Analyze --> Missing Value Analysis, and check EM estimation. EM estimation checks if the subjects with missing values are different than the subjects without missing values. If p<.05, then the two groups are significantly different from each other, which indicates the missing values are non-random. In other words, you want the value to be greater than .05, which indicates the missing values are random. How do I deal with missing values? Irrespective of whether the missing values are random or non-random, you have three options when dealing with missing values. Option 1 is to do nothing. Leave the data as is, with the missing values in place. This is the most frequent approach, for a few reasons. First, missing values are typically small. Second, missing values are typically non-random. Third, even if there are a few missing values on individual items, you typically create composites of the items by averaging them together into one new variable, and this composite variable will not have 2 missing values because it is an average of the existing data. However, if you chose this option, you must keep in mind how SPSS will treat the missing values. SPSS will either use “listwise deletion” or “pairwise deletion” of the missing values. You can elect either one when conducting each test in SPSS. a. Listwise deletion – SPSS will not include cases (subjects) that have missing values on the variable(s) under analysis. If you are only analyzing one variable, then listwise deletion is simply analyzing the existing data. If you are analyzing multiple variables, then listwise deletion removes cases (subjects) if there is a missing value on any of the variables. The disadvantage is a loss of data because you are removing all data from subjects who may have answered some of the questions, but not others (e.g., the missing data). b. Pairwise deletion – SPSS will include all available data. Unlike listwise deletion which removes cases (subjects) that have missing values on any of the variables under analysis, pairwise deletion only removes the specific missing values from the analysis (not the entire case). In other words, all available data is included. For example, if you are conducting a correlation on multiple variables, then SPSS will conduct the bivariate correlation between all available data point, and ignore only those missing values if they exist on some variables. In this case, pairwise deletion will result in different sample sizes for each correlation. Pairwise deletion is useful when sample size is small or missing values are large because there are not many values to begin with, so why omit even more with listwise deletion. c. In other to better understand how listwise deletion versus pairwise deletion influences your results, try conducting the same test using both deletion methods. Does the outcome change? d. IMPORTANT – for each type of test you conduct, you need to identify if SPSS is using listwise or pairwise deletion. I would recommend electing pairwise deletion, if possible. For example, we have been using the Explore command. If you are analyzing more than one variable in the Explore command, be sure to click “Options” and “Exclude cases pairwise” because the default option is listwise deletion. Most tests allow you to elect your preference, but GLM Multivariate only allows listwise. So, always check your output for the number of cases used in each analysis. Option 2 is to delete cases with missing values. For example, for every missing value in the dataset, you can delete the subjects with the missing values. Thus, you are left with complete data for all subjects. The disadvantage to this approach is you reduce the sample size of your data. If you have a large dataset, then it may not be a big disadvantage because you have enough subjects even after you delete the cases with missing values. Another disadvantage to this approach is that the subjects with missing values may be different than the subjects without missing values (e.g., missing values that are non-random), so you have a nonrepresentative sample after removing the cases with missing values. Once situation in which I use Option 2 is when particular subjects have not answered an entire scale or page of the study. Option 3 is to replace the missing values, called imputation. There is little agreement about whether or not to conduct imputation. There is some agreement, however, in which type of imputation to conduct. For example, you typically do NOT conduct Mean substitution or Regression substitution. Mean substitution is replacing the missing value with the mean of the variable. Regression substitution uses regression analysis to replace the missing value. Regression analysis is designed to predict one variable based upon another variable, so it can be used to predict the missing value based upon the subject’s answer to another variable. Both Mean substitution and Regression substitution can be found using: Transform --> Replace Missing Cases. The favored type of imputation is replacing the missing values using different estimation methods. The “Missing Values Analysis” add-on contains the estimation methods, but versions of SPSS without the add-on module do not. The estimation methods be found by using: Transform --> Replace Missing Cases. 3. Outliers – Univariate What are outliers? Outliers are extreme values as compared to the rest of the data. The determination of values as “outliers” is subjective. While there are a few benchmarks for determining whether a value is an “outlier”, those benchmarks are arbitrarily chosen, similar to how “p<.05” is also arbitrarily chosen. Should I check for outliers? Outliers can render your data non-normal. Since normality is one of the assumptions for many of the statistical tests you will conduct, finding and eliminating the influence of outliers may render your data normal, and thus render your data appropriate for analysis using those statistical tests. However, I know no one who checks for outliers. For example, just because a value is extreme compared to 3 the rest of the data does not necessarily mean it is somehow an anomaly, or invalid, or should be removed. The subject chose to respond with that value, so removing that value is arbitrarily throwing away data simply because it does not fit this “assumption” that data should be “normal”. Conducting research is about discovering empirical reality. If the subject chose to respond with that value, then that data is a reflection of reality, so removing the “outlier” is the antithesis of why you conduct research. There is another (less theoretical, and more practical) reason why I know no one who conducts outlier analysis. Outliers are usually found in many (many!) of the variables in ever study. If you are going to check for outliers, then you have to check for outliers in all your variables (e.g., could be 100+ in some surveys), and also check for outliers in the bivariate and multivariate relationships between your variables (e.g., 1000+ in some surveys). Given the large number of outlier analyses you have to conduct in every study, you will invariably find outliers in EVERY STUDY. If you find and eliminate outliers in one of your published studies, then from an ethical and equity point of view, you should conduct the same outlier analysis in every study you analyze for the rest of your career. Many researchers do not want to undertake outlier analysis in every one of their studies because it’s cumbersome and sometimes overwhelming. Plus, if outliers are valid data, then why conduct outlier analysis as all? There is one more (less theoretical, and more practical) reason why I know no one who conducts outlier analysis. It is common practice to use multiple questions to measure constructs because it increases the power of your statistical analysis. You typically create a “composite” score (average of all the questions) when analyzing your data. For example, in a study about happiness, you may use an established happiness scale, or create your own happiness questions that measure all the facets of the happiness construct. When analyzing your data, you average together all the happiness questions into 1 happiness composite measure. While there may be some outliers in each individual question, averaged the items together reduces the probability of outliers due to the increased amount of data composited into the variable. There is one last (less theoretical, and more practical) reason why I know no one who conducts outlier analysis. If you decide to reduce the influence of the outlier, as described in the next section, you then re-run the outlier analysis again after reducing the influence of the known outliers to determine if the outlier is eliminated. Sometimes new outliers emerge because they were masked by the old outliers and/or the data is now different after removing the old outlier so existing extreme data points may now qualify as outliers. Once those outliers are removed, you re-run the outlier analysis again, and sometimes new outliers emerge again. It can become a cumbersome and sometimes overwhelming process that has no end in sight. Plus, at what point, if any, should you draw the line and stop removing the newly emerging outliers? There are two categories of outliers – univariate and multivariate outliers a. Univariate outliers are extreme values on a single variable. For example, if you have 10 survey questions in your study, then you would conduct 10 separate univariate outlier analyses, one for each variable. Also, when you average the 10 questions together into a new composite variable, you can conduct univariate outlier analysis on the new variable. Another way you would conduct univariate analysis is by looking at individual variables within different groups. For example, you would conduct univariate analysis on those same 10 survey questions within each gender (males and females), or within political groups (republican, democrat, other), etc. Or, if you are conducting an experiment with more than one condition, such as manipulating happiness and sadness in your study, then you would conduct univariate analysis on those same 10 survey questions within both groups. b. The second category of outliers is multivariate outliers. Multivariate outliers are extreme combinations of scores on two or more variables. For example, if you are looking at the relationship between height and weight, then there may be a joint value that is extreme compared to the rest of the data, such as someone with extremely low height but high weight, or high weight but low height, and so forth. You first look for univariate outliers, then proceed to look for multivariate outliers. Univariate outliers: 1. Select Analyze --> Descriptive Statistics --> Explore 2. Move all variables into the “Variable(s)” window. 3. Click “Statistics”, and click “Outliers” 4. Click “Plots”, and unclick “Stem-and-leaf” 5. Click OK. 4 Output on next page is for “system1” “Descriptives” box tells you descriptive statistics about the variable, including the value of Skewness and Kurtosis, with accompanying standard error for each. This information will be useful later when we talk about “normality”. The “5% Trimmed Mean” indicates the mean value after removing the top and bottom 5% of scores. By comparing this “5% Trimmed Mean” to the “mean”, you can identify if extreme scores (such as outliers that would be removed when trimming the top and bottom 5%) are having an influence on the variable. “Extreme Values” and the Boxplot relate to each other. The boxplot is a graphical display of the data that shows: (1) median, which is the middle black line, (2) middle 50% of scores, which is the shaded region, (3) top and bottom 25% of scores, which are the lines extending out of the shaded region, (4) the smallest and largest (non-outlier) scores, which are the horizontal lines at the top/bottom of the boxplot, and (5) outliers. The boxplot shows both “mild” outliers and “extreme” outliers. Mild outliers are any score more than 1.5*IQR from the rest of the scores, and are indicated by open dots. IQR stands for “Interquartile range”, and is the middle 50% of the scores. Extreme outliers are any score more than 3*IQR from the rest of the scores, and are indicated by stars. However, keep in mind that these benchmarks are arbitrarily chosen, similar to how p<.05 is arbitrarily chosen. For “system1”, there is an open dot. Notice that the dot says “42”, but, by looking at “Extreme Values box”, there are actually FOUR lowest scores of “1”, one of which is case 42. Since all four scores of “1” overlap each other, the boxplot can only display one case. In summary, this output tells us there are four outliers, each with a value of “1”. 5 4. Outliers – Within Groups Another way to look for univariate outliers is to do outlier analysis within different groups in your study. For example, imagine a study that manipulated the presence or absence of a weapon during a crime, and the Dependent Variable was measuring the level of emotional reaction to the crime. In addition to looking for univariate outliers for your DV, you may want to also look for univariate outliers within each condition. In our dataset about “Legal Beliefs”, let’s treat gender as the grouping variable. 1. Select Analyze --> Descriptive Statistics --> Explore 2. Move all variables into the “Variable(s)” window. Move “sex” into the “Factor List” 3. Click “Statistics”, and click “Outliers” 4. Click “Plots”, and unclick “Stem-and-leaf” 5. Click OK. Output below is for “system1” “Descriptives” box tells you descriptive statistics about the variable. Notice that information for “males” and “females” is displayed separately. “Extreme Values” and the Boxplot relate to each other. Notice the difference between males and females. 6 5. Outliers - Multivariate Multivariate outliers are traditionally analyzed when conducting correlation and regression analysis. The multivariate outlier analysis is somewhat complex, so I will discuss how to identify multivariate outliers when we get to correlation and regression 6. Outliers – dealing with outliers First, we need to identify why the outlier(s) exist. It is possible the outlier is due to a data entry mistake, so you should first conduct the test described above as “1. Finding incorrectly entered data” to ensure that any outlier you find is not due to data entry errors. It is also possible that the subjects responded with the “outlier” value for a reason. For example, maybe the question is poorly worded or constructed. Or, maybe the question is adequately constructed but the subjects who responded with the outlier values are different than the subjects who did not respond with the extreme scores. You can create a new variable that categorizes all the subjects as either “outlier subjects” or “non-outlier subjects”, and then re-examine the data to see if there is a difference between these two types of subjects. Also, you may find the same subjects are responsible for outliers in many questions in the survey by looking at the subject numbers for the outliers displayed in all the boxplots. Remember, however, that just because a value is extreme compared to the rest of the data does not necessarily mean it is somehow an anomaly, or invalid, or should be removed. Second, if you want to reduce the influence of the outliers, you have four options. Option 1 is to delete the value. If you have only a few outliers, you may simply delete those values, so they become blank or missing values. Option 2 is to delete the variable. If you feel the question was poorly constructed, or if there are too many outliers in that variable, or if you do not need that variable, you can simply delete the variable. Also, if transforming the value or variable (e.g., Options #3 and #4) does not eliminate the problem, you may want to simply delete the variable. Option 3 is to transform the value. You have a few options for transforming the value. You can change the value to the next highest/lowest (non-outlier) number. For example, if you have a 100 point scale, and you have two outliers (95 and 96), and the next highest (non-outlier) number is 89, then you could simply change the 95 and 96 to 89s. Alternatively, if the two outliers were 5 and 6, and the next lowest (non-outlier) number was 11, then the 5 and 6 would change to 11s. Another option is to change the value to the next highest/lowest (non-outlier) number PLUS one unit increment higher/lower. For example, the 95 and 96 numbers would change to 90s (e.g., 89 plus 1 unit higher). The 5 and 6 numbers change to 10s (e.g., 11 minus 1 unit lower). Option 4 is to transform the variable. Instead of changing the individual outliers (as in Option #3), we are now talking about transforming the entire variable. Transformation creates normal distributions, as described in the next section below about “Normality”. Since outliers are one cause of non-normality, see the next section to learn how to transform variables, and thus reduce the influence of outliers. Third, after dealing with the outlier, you re-run the outlier analysis to determine if any new outliers emerge or if the data are outlier free. If new outliers emerge, and you want to reduce the influence of the outliers, you choose one the four options again. Then, re-run the outlier analysis to determine if any new outliers emerge or if the data are outlier free, and repeat again. 7. Normality Below, I describe five steps for determining and dealing with normality. However, the bottom line is that almost no one checks their data for normality; instead they assume normality, and use the statistical tests that are based upon assumptions of normality that have more power (ability to find significant results in the data). First, what is normality? A normal distribution is a symmetric bell-shaped curve defined by two things: the mean (average) and variance (variability). Second, why is normality important? The central idea behind statistical inference is that as sample size increases, distributions will approximate normal. Most statistical tests rely upon the assumption that your data is “normal”. Tests that rely upon the assumption or normality are called parametric tests. If your data is not normal, then you would use statistical tests that do not rely upon the assumption of normality, call non7 parametric tests. Non-parametric tests are less powerful than parametric tests, which means the non-parametric tests have less ability to detect real differences or variability in your data. In other words, you want to conduct parametric tests because you want to increase your chances of finding significant results. Third, how do you determine whether data are “normal”? There are three interrelated approaches to determine normality, and all three should be conducted. First, look at a histogram with the normal curve superimposed. A histogram provides useful graphical representation of the data. SPSS can also superimpose the theoretical “normal” distribution onto the histogram of your data so that you can compare your data to the normal curve. To obtain a histogram with the superimposed normal curve: 1. Select Analyze --> Descriptive Statistics --> Frequencies. 2. Move all variables into the “Variable(s)” window. 3. Click “Charts”, and click “Histogram, with normal curve”. 4. Click OK. Output below is for “system1”. Notice the bell-shaped black line superimposed on the distribution. All samples deviate somewhat from normal, so the question is how much deviation from the black line indicates “non-normality”? Unfortunately, graphical representations like histogram provide no hard-and-fast rules. After you have viewed many (many!) histograms, over time you will get a sense for the normality of data. In my view, the histogram for “system1” shows a fairly normal distribution. Second, look at the values of Skewness and Kurtosis. Skewness involves the symmetry of the distribution. Skewness that is normal involves a perfectly symmetric distribution. A positively skewed distribution has scores clustered to the left, with the tail extending to the right. A negatively skewed distribution has scores clustered to the right, with the tail extending to the left. Kurtosis involves the peakedness of the distribution. Kurtosis that is normal involves a distribution that is bell-shaped and not too peaked or flat. Positive kurtosis 8 is indicated by a peak. Negative kurtosis is indicated by a flat distribution. Descriptive statistics about skewness and kurtosis can be found by using either the Frequencies, Descriptives, or Explore commands. I like to use the “Explore” command because it provides other useful information about normality, so 1. Select Analyze --> Descriptive Statistics --> Explore. 2. Move all variables into the “Variable(s)” window. 3. Click “Plots”, and unclick “Stem-and-leaf” 4. Click OK. Descriptives box tells you descriptive statistics about the variable, including the value of Skewness and Kurtosis, with accompanying standard error for each. Both Skewness and Kurtosis are 0 in a normal distribution, so the farther away from 0, the more non-normal the distribution. The question is “how much” skew or kurtosis render the data non-normal? This is an arbitrary determination, and sometimes difficult to interpret using the values of Skewness and Kurtosis. Luckily, there are more objective tests of normality, described next. Third, the descriptive statistics for Skewness and Kurtosis are not as informative as established tests for normality that take into account both Skewness and Kurtosis simultaneously. The Kolmogorov-Smirnov test (K-S) and Shapiro-Wilk (S-W) test are designed to test normality by comparing your data to a normal distribution with the same mean and standard deviation of your sample: 1. Select Analyze --> Descriptive Statistics --> Explore. 2. Move all variables into the “Variable(s)” window. 3. Click “Plots”, and unclick “Stem-and-leaf”, and click “Normality plots with tests”. 4. Click OK. “Test of Normality” box gives the K-S and S-W test results. If the test is NOT significant, then the data are normal, so any value above .05 indicates normality. If the test is significant (less than .05), then the data are non-normal. In this case, both tests indicate the data are non-normal. However, one limitation of the normality tests is that the larger the sample size, the more likely to get significant results. Thus, you may get significant results with only slight deviations from normality. In this case, our sample size is large (n=327) so the significance of the K-S and S-W tests may only indicate slight deviations from normality. You need to eyeball your data (using histograms) to determine for yourself if the data rise to the level of non-normal. 9 “Normal Q-Q Plot” provides a graphical way to determine the level of normality. The black line indicates the values your sample should adhere to if the distribution was normal. The dots are your actual data. If the dots fall exactly on the black line, then your data are normal. If they deviate from the black line, your data are nonnormal. In this case, you can see substantial deviation from the straight black line. Fourth, if your data are non-normal, what are your options to deal with non-normality? You have four basic options. a. Option 1 is to leave your data non-normal, and conduct the parametric tests that rely upon the assumptions of normality. Just because your data are non-normal, does not instantly invalidate the parametric tests. Normality (versus non-normality) is a matter of degrees, not a strict cut-off point. Slight deviations from normality may render the parametric tests only slightly inaccurate. The issue is the degree to which the data are non-normal. b. Option 2 is to leave your data non-normal, and conduct the non-parametric tests designed for nonnormal data. c. Option 3 is to conduct “robust” tests. There is a growing branch of statistics called “robust” tests that are just as powerful as parametric tests but account for non-normality of the data. d. Option 4 is to transform the data. Transforming your data involving using mathematical formulas to modify the data into normality. Fifth, how do you transform your data into “normal” data? There are different types of transformations based upon the type of non-normality. For example, see handout “Figure 8.1” on the last page of this document that shows six types of non-normality (e.g., 3 positive skew that are moderate, substantial, and severe; 3 negative skew that are moderate, substantial, and severe). Figure 8.1 also shows the type of transformation for each type of non-normality. Transforming the data involves using the “Compute” function to create a new variable (the new variable is the old variable transformed by the mathematical formula): 1. Select Transform --> Compute Variable 2. Type the name of the new variable you want to create, such as “transform_system1”. 3. Select the type of transformation from the “Functions” list, and double-click. 4. Move the (non-normal) variable name into the place of the question mark “?”. 5. Click OK. The new variable is reproduced in the last column in the “Data view”. Now, check that the variable is normal by using the tests described above. If the variable is normal, then you can start conducting statistical analyses of that variable. If the variable is non-normal, then try other transformations. 10