Statistics – Spring 2008 Lab #2 – Descriptives Descriptive analysis involves examining the characteristics of individual variables, as compared to inferential statistics which examines the relationship between variables. There are two types of variables -- categorical and continuous -- and the characteristics of interest for each variable are different. For categorical variables, you are interested in the “count”, such as demographic characteristics of your study (e.g., 50 males and 56 females). For continuous variables, there are many different characteristics to examine, such as mean, median, mode, range, variability, etc., but the mean is typically the most useful descriptor. This document explains how to examine characteristics of categorical and continuous variables. Descriptive analysis involves the same SPSS commands as for Data Screening (e.g., Explore, Frequencies), so you are already intimately familiar with how to conduct descriptive analysis. This document also explains how to create composites by averaging together individual variables into new composited variables. Compositing can involve a few different tasks, such as reverse coding items, averaging items with different scale ranges, and conducting reliability analysis to determine if from a statistical point of view the individual items “should” be averaged together. All those tasks are described below. This document also explain new skills that are related to descriptive analysis you may want to learn, such as how to transform a continuous variable into a categorical variable, how to create a new variable based upon the combination of two or more variables, and how to use syntax. 1. Descriptive Statistics Your two options for descriptive analysis are: “Frequencies” command and “Explore” command. Both provide much of the same information, except a. Frequencies -- groups together the descriptive information into one grid; and displays histograms with a normal curve (whereas Explore displays histograms but without normal curves) b. Explore -- displays descriptive information for each variable separately; and displays boxplots Frequencies: 1. Select Analyze --> Descriptive Statistics --> Frequencies 2. Move all variables into the “Variable(s)” window. 3. Click “Statistics” and put a checkmark next to every descriptive statistics you are interested in viewing. 4. Click “Charts” and put a checkmark next to the chart type you are interested in viewing. 5. Click OK. Output below is for the first four “demographic” questions. “Statistics” box provides a grid format of the descriptive statistics for each variable. 1 After the Statistics box, the frequency distribution for each variable is displayed. Below is the frequency distribution for gender: Next comes the histograms. Below are the histograms for age and gender. I chose to display to you these two histograms because it illustrates how the “Frequency” histogram is useful for displaying both categorical and continuous variables. Also, notice that both histograms are not normally distributed. Not every variable needs to be normally distributed. Plus, categorical variables with few answer choices (e.g., 2, 3, 4, 5, 6) will rarely conform to a normal curve. Finally, in the age histogram notice the sharp drop-off below the “20” line. This is because we restricted participation in the study to people who were aged 18 or older. 2 If you double-click on the histogram in the SPSS output viewer, it opens a new window containing the histogram with many new drop-down options to manipulate the histogram. There are too many options to explain them all, so feel free to try each one, and if you have specific questions, please let me know. However, one option I wanted to present to you was the ability to change the scale range on the histogram axis. For example, if you double-click on the “age” histogram, it opens a new window. Then, double-click on the horizontal axis, which opens the “Properties” window. Then, click on “scale”, and change the scale range from 18 to 72 (which is the minimum and maximum in our sample), and change the increment value to 1. Click “Apply”. The new histogram for age is displayed below. Notice how much more information is displayed. Now lets use the Explore command: 1. Select Analyze --> Descriptive Statistics --> Explore 2. Move all variables into the “Variable(s)” window. 3. Click “Plots” and unclick “Stem-and-leaf” 4. Click “Options” and click “Exclude cases pairwise” 5. Click OK. Output below is for only the four “system” variables in our dataset because copy/pasting the output for all variables in our dataset would take up too much space in this document. “Case Processing Summary” shows the number of cases that are valid, missing, total. 3 “Descriptives” shows the same information as the “Frequencies” command, but now each variable is displayed separately. 4 Next, the boxplot for each variable is displayed. Below is the boxplot for “edu” because I want to show you a boxplot that contains both mild outliers (round dots) and extreme outliers (stars). What if you want descriptive statistics within groups? For example, imagine a study that manipulated the presence or absence of a weapon during a crime, and the Dependent Variable was measuring the level of emotional reaction to the crime. In addition to looking for descriptive statistics of your DV within the entire study (so collapsing across both groups), you may also want descriptive statistics for your DV within each group. Another example of when you would want descriptive statistics within groups is when your study involves a verdict choice. Typically, you not only report the percentage of guilty/not-guilty verdicts across the entire study, but you also want to report the percentage of guilty/not-guilty verdicts within each group in your study. I present an example of this situation on the next page, and how to present this data in a Figure. How to conduct descriptive statistics within groups: In our dataset about “Legal Beliefs”, let’s treat gender as the grouping variable because sometimes you also want to present the gender split amongst your variables: 1. Select Analyze --> Descriptive Statistics --> Explore 2. Move all variables into the “Variable(s)” window. Move “sex” into the “Factor List” 3. Click “Statistics”, and click “Outliers” 4. Click “Plots”, and unclick “Stem-and-leaf” 5. Click OK. Output on next page is for “system1” 5 “Descriptives” box tells you descriptive statistics about the variable. Notice that information for “males” and “females” is displayed separately. WRITE-UP: You typically discuss the characteristics of demographics in the beginning of the Method section, not the Results section, and you also typically only present data for gender (see below). If you want to discuss more than just gender, such as age, education, political afflitiation, income, etc., then you would create a Figure to display all the data. For descriptive statitics other than demographics, you would present that data in the Results section. If there are only a few descriptive statistics, you discuss them in the text of the Results section (see below). If there are many descriptive statistics, you present them in a Figure, and then discuss only the most pertinent information from the Figure when you are writing the Results/Discussion section. a. Here is a sample write-up for gender: “The sample consisted of 327 participants, with many more females (n = 248) than males (n = 76), and three participants who did not indicate gender.” b. Here is a sample write-up for how you would discuss descriptive statistics in the Results section: “When asked what percentage of people brought to trial did in fact commit the crime, the average response was 78%.” c. Here is a Figure (from another paper I wrote a few years ago): (FYI – see http://www.docstyles.com/apa15.htm for how to format Figures and Tables in APA format) Verdict Choices for Death Qualified and Excludable Jurors Witherspoon Witherspoon Excludable Includable Witt Excludable Witt Includable Verdict Choice 1. Guilty 29.2% 77.9% 31.2% 71.8% 2. Not Guilty 70.8% 22.1% 68.8% 28.2% 6 EVALUATION: Since “evaluating” descriptive statistics in Results sections or Figure is simply reading the descriptive statistics that are reported, I don’t have any advice for evaluating descriptive statistics other than to pay attention if there are any other descriptive statistics that were not reported that you may find helpful or would want the author to include in the paper. 2. Other graphs SPSS has the ability to create other types of graphs beyond histograms and boxplots, but they provide little information beyond the information provided by histograms and boxplots. The other charts are: a. Bar charts b. 3D bar charts c. Line charts d. Area charts e. Pie charts To access these charts: 1. Select Graphs --> choose either “Chart Builder” or “Legacy Charts” 2. Move chosen variables into the appropriate open spaces 3. Click OK. “Legacy Charts” are the old way that SPSS builds charts. Each chart has a separate command window, each with its own unique options and characteristics. The options and characteristics are very straightforward and easy to use. “Chart Builder” is new to SPSS. It reportedly has more functionality, but it is also complex and sometimes difficult to manipulate. I would suggest first using the Legacy Charts to get a better understanding of each type of chart. 3. Composites – averaging items together Why do we create composites? The rule of thumb in statistics is “the more, the better”. In terms of measuring constructs, this means that you typically want to ask many questions about the same construct in order to adequately tap into the entire construct of interest. For example, in a study about happiness, asking, “how happy are you right now” perfectly maps onto the construct of “how happy you are right now”. But, if your intended construct is “happiness”, you need to ask more questions to tap the entire theoretical construct, such as “how happy do you feel”, “how happy are you with your life in general”, and etc. Thus, for every construct, researchers ask many questions by either using established scales of the topic, or creating their own measures to tap all the facets of the construct. When you analyze the data, you start by conducting descriptive analysis of each individual question. Then, you composite all 10 questions together into 1 variable by averaging together all 10 questions. Researchers are typically more interested in that 1 composite variable than the 10 individual items (unless the 10 questions are uniquely taping different sub-parts of the entire construct, and the researchers are interested in each sub-part). So, after first conducting descriptive analysis of each item, you then conduct descriptive analysis of the 1 composite variable. How do you create a composite? 1. Select Transform --> Compute Variable 2. Type a new name for your composite in the “Target Variable” box. 3. Drag “mean” from the “Function group” into the open box above 4. Replace the question marks (?) with each item to be composited, separated by a comma (,) 5. Click OK. The newly created composite will appear at the end of the data file. Is it appropriate to create a composite with my questions? We described above how to create a composite, but another question is whether its appropriate to create the composite given the questions and data in your study. You can answer that question from a theoretical point of view, and a statistical point of view. I describe below both points of view: 7 From a theoretical point of view… a. From a theoretical point of view, it is possible your questions do not measure the same construct, and thus it is inappropriate to average them together. For example, the face content of each item may measure different concepts. Imagine questions about your political group orientation. A question about whether you “think” of yourself as a republican or democrat, may tap a different construct then if you ask whether you “feel” like a republican or democrat. You need to examine your questions and make a determination of whether you feel its appropriate to average the items together. b. Another option is create separate composites, one for each concept that is measured. For example, maybe you composite together all the questions about how you “feel” about your political group membership, and create another composite of the questions about how you “think” of your political group membership. After creating the separate composites, you can then also merge all the questions together (so merge all the separate composites together) into 1 big composite. In this case, you would call the separate composites you merged together the “sub-parts” or “sub-factors” of the 1 big composite. Also, from a theoretical point of view you need to decide how to label or characterize this big composite. c. It is acceptable to create composites from a theoretical point of view even if it is not appropriate from a statistical point of view. I discuss next the benchmarks for deciding whether or not its statistically appropriate to merge items together into a composite, but assuming those benchmarks are not met in your data, it is still appropriate to merge items together from a purely theoretical point of view. However, you must state in your paper that the statistical benchmarks were not met, and then explain the theoretical basis for why you are still merging the items together. (FYI – if the statistical benchmarks are met, then you rarely see researchers explain the theoretical basis for why the items were merged together.) From a statistical point of view… a. From a statistical point of view, it is possible your questions do not measure the same construct, and thus it is inappropriate to average them together. For example, you can use “Factor Analysis” to determine if the items fall into 1 big composite (called a “factor”), or if they fall into separate subfactors. I will explain Factor Analysis at the end of the semester, but only if you request it. Factor analysis is not one of the more typical statistical tests. Instead, researchers decide how the items group together from a theoretical point of view, and then proceed to test their judgment by conducting “Reliability Analysis”, which provides a benchmark for determining whether or not the items group together. In other words, Reliability Analysis is called a “confirmatory” test because its confirming your decisions, whereas Factor Analysis is considered a “exploratory” test because it is used to explore which, if any, of the items group together into which set of factors or sub-factors. b. Reliability Analysis is rather straightforward to conduct: 1. Select Analyze --> Scale --> Reliability Analysis 2. Move all variables into the “Variable(s)” window. 3. Click “Statistics” and put a checkmark next to “item” and “Scale if item deleted” 4. Click OK. “Reliability Statistics” give you the “Alpha” number which is the determination of whether or not the items group together from a statistical point of view. Alpha ranges from 0 to 1, and the higher the number, the stronger the items group together statistically. Output below is for the three “prosecutor” questions. Alphas above .9 are great, above .8 are good, above .7 are ok, above .6 are borderline. In this case, Alpha=.68, which is acceptable to merge the three items together into a composite. Also, the smaller the sample, the more likely you will find smaller Alpha levels because there is less data to identify intercorrelations. In smaller samples, smaller Alpha levels are acceptable to create composites. 8 The other output from the analysis is helpful to interpret your data. “Case Processing Summary” tells you the number of valid cases included in the analysis. Notice that only listwise deletion is possible. “Item Statistics” gives you descriptive information about each item. “Item-total Statistics” tells you the Alpha levels if each items is removed. Notice that Alpha improves to .78 if we remove “prosecutor3”. In this case, because there are so few items (e.g., 3), I would suggest not removing “prosecutor3”, even though it improves Alpha, because only 2 items is not much of a composite. If we were analyzing many items (e.g., 4+), then it would be more appropriate to exclude items. WRITE-UP: “The three items measuring attitudes toward prosecutors formed a reliable composite (α = .68).” EVALUATION: For each composite in the paper, the author(s) need to report the alpha level, which is the statistic that tells you whether or not the items group together statistically. Alpha is determined by the strength of the bivariate relationships amongst all the items in the composite. The higher the internal consistency amongst items, the higher the Alpha level. Alphas above .9 are great, above .8 are good, above .7 are ok, above .6 are borderline. Also, the smaller the sample, the more likely you will find smaller Alpha levels because there is less data to identify intercorrelations. 4. Items with different scale ranges If you are going to composite together multiple items, all the items need to have the same scale range. For example, lets say we ask two happiness questions: (1) “How happy are you right now?” on a 1-7 scale, and (2) “How happy do you feel?”, on a -3 to 3 scale. Notice that the two questions are about the same construct (so theoretically you can merge them together), and also notice that the total range of the scales for both items are 7 points, BUT the scale ranges are along different dimensions. Compositing involves averaging items together. If we average together these two items, the resulting average will not be interpretable because of the different scale ranges. For example, a “1” on the first item is the lowest possible answer choice, but a “1” on the second item is one of the highest possible choices. The solution is to transform both scale ranges into a common metric. This is accomplished by first “standardizing” both items. Then, we composite the newly transformed items. Before we get to how to standardize items, I want to point out why I included in the example a scale that ranged from a negative number (-3) to a positive number (3). Sometimes when you are measuring constructs, there is a natural mid-point or neutral point, such as with “happiness” where you could have “0” happiness at 9 the moment. In this situation, it can be beneficial to include an answer choice that is neutral or “0”. Notice that if we asked the same question but with a 1-7 scale, if you wanted to indicate you are feeling zero happiness at the moment, your only answer choice would be a “1”, which you may not feel indicates you absence of happiness. Another reason to include a scale that ranges from negative to positive is that your construct also ranges from negative to positive. For example, imagine a question that asked about your feelings about the death penalty. You could have a negative view or a positive view of the death penalty, so in order to tap that construct you need to include in the scale range answer choices that reflect positive and negative. Another way to reflect both positive and negative in a scale with the labels. For example, you could ask the same question about your feelings toward the death penalty on a 1-7 scale, but have the labels for “1” be strongly oppose, and for “4” be neutral, and for “7” be strongly support. I also want to point out that standardizing your items to transform items to a common metric is necessary when any of the scale ranges differ, not just with negative versus positive items, as in the example above. For example, you may ask questions about the death penalty that are so similar that you want to vary the scale ranges of the items so that you tap into more information (and also force the subjects to pay more attention to the items because all items with the same scale range may allow lazy subjects to answer the same way on similar questions without thinking carefully about their answers). To Standardize items: 1. Select Analyze --> Descriptive Statistics --> Descriptives 2. Move all variables into the “Variable(s)” window. 3. Put a checkmark next to “Save standardized values as variables” 4. Click OK. The newly standardized variables are listed at the end of the data file. Each standardized variable is listed in a separate column. You can then analyze the new standardized variables as you would any other variable in your data set, including averaging them together to create a composite. 5. Reverse coding items If you are going to composite together multiple items, all the items need to be “in the same direction”. This means that indicating a higher (or lower) response each scale must correspond conceptually to answering higher (or lower) on the other items you want to composite together. For example, lets say we ask two happiness questions: (1) “How happy are you right now?” on a 1-7 scale, and (2) “How unhappy you are right now?”, on a 1-7 scale. Notice that the two questions are about the same construct (so theoretically you can merge them together), and also notice that the total range of the scales for both items are 7 points, BUT that conceptually answer higher (or lower) on one item is the same as answering lower (or higher) on the other item. Before we can composite them together, we need to transform all the items so that they are “in the same direction”. Thus, we could either reverse code the scale range for the first item, or the second item (but not obviously both items). Composites typically contain multiple items, so you typically have to reverse code multiple items. Also, when choosing which set of items to reverse code (e.g., either the items that are in the positive direction, or items that are in the negative direction), you should think ahead to the statistical analyses you want to conduct and how you want output from those statistical analyses (or the relationship between those variables) to be conceptualized. For example, imagine a study testing the relationship between happiness and income. If your hypothesis is that more income is correlated with more happiness, then conceptually we want our “happiness” composite to code in the positive direction (so that higher on the scale means more happiness) so that the outcome is easier to interpret. Notice, that if we code the happiness composite in the opposite direction (so that lower means more happiness), we will still get the same conceptual outcome as with the positively coded composite -- that more happiness is correlated with more income -- but the interpretation of the outcome will be more difficult because we will get a negative correlation between the variables (because lower on the happiness scale is more happiness, and more happiness is correlated with higher income. Thus, think ahead to your intended results and code all the items in the appropriate direction. 10 To reverse code items: 1. Select Transform --> Recode into different variables 2. Move one item into the “Input Window” 3. Type a name for the new variable. (I like to use the same name as the original variable, but labeled with “_rev”, such as “system1_rev”) 4. Click “Changes” 5. Click “Old and New Values” 6. Enter the “old” value and the “new value” and click “Add” (If reverse coding a 1-7 scale, then old=1, new=7; old=2, new=6; old=3, new=5, and etc.) 7. Click Continue 8. Click OK. The newly reverse coded variable is listed at the end of the data file. Notice that instead of “Recode into different variables”, there is an option for “Recode into same variables”. I do not use this function because I like to leave the old variable intact because I like to keep a permanent record of each variable, and you may forget you reverse coded it and reverse code it again, and you may make a mistake in reverse code that can't be undone if the old variable has been replaced. 6. SYNTAX Up this point, we have learned that SPSS has two windows –Data Editor (grid of data) and Viewer (output). SPSS has a third window – Syntax. What is syntax? When you point-and-click in the Data Editor for SPSS to calculate a mean, or outlier, or correlation, or whatever, SPSS is calculating the statistical formulas for those tests. SPSS is basically a big calculator that can perform many different calculations. When you point-and-click in the Data Editor for SPSS, you are telling SPSS how to perform those calculations, such as include “Kurtosis”, or “exclude cases pairwise”, or “run correlations on these three specific variables, and not the other variables”. Another way to tell SPSS to perform those same operations is to use programming language. In the syntax window, you can type programming language, then hit the “run” button, and SPSS will perform the calculations. This process is analogous to how a website designer writes computer code to design a website, but you don’t see the code, only the website design. Similarly, the point-and-click functionality in SPSS is analogues to the website design you see, and the syntax functionality in SPSS is analogues to the background computer code that you typically don’t see. Why use syntax? The point-and-click” interface is very easy to use. You don’t need to learn the syntax programming language which can sometimes get overwhelming and difficult to understand. However, there are some advantages to syntax. For one, you can perform multiple operations easier than with the point-andclick interface. For example, in the previous section about reverse coding items, you can only reverse code 1 item at a time. If you want to reverse code multiple items, you have to repeat the same steps over and over. Syntax makes that repetition less time-consuming. I present an example below about reverse coding, but I want to point out that you can use syntax for any point-and-click command. For example, for every command in SPSS, instead of clicking “OK” as the last step, you can click “PASTE” instead as the last step, and it will display the syntax. To reverse code items: 1. Select Transform --> Recode into different variables 2. Move one item into the “Input Window” 3. Type a name for the new variable. (I like to use the same name as the original variable, but labeled with “_rev”, such as “system1_rev”) 4. Click “Changes” 5. Click “Old and New Values” 6. Enter the “old” value and the “new value” and click “Add” (If reverse coding a 1-7 scale, then old=1, new=7; old=2, new=6; old=3, new=5, and etc.) 7. Click Continue 8. Click PASTE 11 Notice that the last action is “PASTE”, not OK. The syntax window will open, and the command you just initiated is displayed using syntax code. I have pasted below the syntax for our example. “RECODE” is the command to perform. Notice that the old variable name and new variable name are in the command line. Notice that it ends with “EXECUTE.”. If we wanted to “run” this command, we would highlight the entire syntax, and click the arrow button: ► RECODE system1 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system1_rev. EXECUTE. We are using this example to show how using syntax can speed up repetitive actions. So if we copy/paste the syntax over and over, we can then type in the other variables we need reverse code. Then, we highlight all the syntax, and click the arrow button to run the syntax. RECODE system1 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system1_rev. EXECUTE. RECODE system2 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system2_rev. EXECUTE. RECODE system3 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system3_rev. EXECUTE. RECODE system4 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system4_rev. EXECUTE. Another way to use syntax is to keep a record of your statistical analyses because the syntax indicates not only which statistical analyses was performed, but it also provides a record of how you performed those statistical analyses and which options you chose to use. The Output window provides that record by displaying the syntax for every analyses that is conducted. 6. Transforming continuous variables into categorical variables (and categorical variables into different categorical variables) It is possible to transform continuous variables into categorical variables. For example, imagine a study about happiness where your happiness item (or composite) ranges from 1 to 7. You might be interested in categorizing the subjects as either high happiness (4 through 7 on the scale) or low happiness (1 through 4 on the scale). This is called “dichotomizing” the variable because you are creating a new variable that has only two options. Another example of why you would want to transform a continuous variable into a categorical variable is if there are only a few responses on some of the answer choices in the continuous variable. For example, imagine a scale range from 1-11 in which answer choice 4 and/or answer choice 9 received only 1 response each. 1 response is not enough data for meaningful interpretation. You may want to collapse the 11 point scale into 3 or 4 categories. As another example, look at the “rel_category” in our dataset which measures the religious category memberships of the subjects. The frequency distribution is listed on the next page. Hindu received only 6 responses, and Jewish received only 9 responses. You may want to merge those responses into “other” and/or merge all the data into “Christian” versus “other”. Notice that creating the new categorical variable is answering a different research question than the original categorical variable. 12 Transforming variables in this way uses the same SPSS command as for reverse coding items. To transform the variables: 1. Select Transform --> Recode into different variables 2. Move one item into the “Input Window” 3. Type a name for the new variable. (I like to use the same name as the original variable, but labeled with “_cat”, such as “system1_cat”) 4. Click “Changes” 5. Click “Old and New Values” 6. Click “Range” and enter the range of values of the “old” variable, and assign a number for new variable. (e.g., 1-3.999 become a “1”, and 4.0001-7 becomes a “2”) 7. Click Continue 8. Click OK. The newly transformed variable is listed at the end of the data file. I would suggest then going into the “Variable View” and assigning value labels in the “Values” column that reflect how you cut the variable. For example, if you just created a new categorical variable where 1-3.999 become a “1”, and 4.0001-7 becomes a “2”, then assign 1=1-3.999, and 2=4.001-7. Thus, you keep a record of what the “1” and “2” means. How do I decide where to split up the variable? This is a complex question with a complex answer: If you are dichotomizing a variable, you can split at the midpoint of the scale from a theoretical point of view because that is conceptually the middle response. Plus, sometimes you choose to use an odd scale range because that is designed to have a true mid-point. However, what if in your dataset there are more subjects in the high or low end of the scale. Splitting at the mid-point of the scale might create a vastly unequal distribution when you dichotomize the variable. What if, for example, splitting at the midpoint of the scale has 70-80% of the subjects in one end, and 10-20% in the other. You are already losing valuable information by reducing from a continuous variable to a categorical variable, and if you have unbalanced categories, you are losing even more information. In this case, you could choose to split at the median, even if the median is not the midpoint of the scale. From a theoretical point of view, the median is a good choice for splitting the variable because it is the mid-point of that sample. Samples are not always normally distributed. Research is about discovering empirical reality, so sometimes reality dictates how subjects respond to the question, and maybe assuming the midpoint of the scale is the true midpoint of the construct is inaccurate. Plus, from a statistical point of view, the median truly splits the sample into halves. However, what if in your dataset the median is a very high or low number on the scale range. For example, what if on a 1-7 point scale, the median is a 2 or a 6. In this situation half of the scores are bunched into a small range (e.g., 2 points in this example), whereas the other half are more evenly distributed across a larger range (e.g., 5 point in this example). Once again, you are losing valuable information by dichotomizing in this way. In summary, theoretical and statistical considerations when dichotomizing variables. One solution is to dichotomize in both ways and analyze the data using both variables. The same theoretical and practical considerations come into play when you are deciding to split the variables in other ways. You may decide, for example, to cut the continuous variable in thirds, or fourth, or fifths. Sometimes when you cut the variable into thirds, your new categorical variable only includes the top and bottom third. Sometimes you are only interested in the more polarized decisions. Sometimes you can strengthen the relationship between your variables by only including the polarized judgments. From a theoretical point of view it can make sense to drop the middle third because they are the subjects who are somewhat undecided about the construct. Plus, think about why dichotomizing continuous variables results in reduced information and reduced statistical power. Subjects in the continuous variable who are near the middle are now the same as subjects near the top/bottom after you dichotomize the variable. In a 100 point scale for example, the subjects who respond 49 and 51 are treated the same as the subjects who respond 0 and 100, respectively. Thus, you are reducing your ability to detect true relationships in the study because the subjects close to the middle may be masking relationships amongst your variables by diluting the strength of the high/low categories in the variable. Eliminating the middle third when you cut the continuous variable in thirds is one way to create a categorical variable while minimizing your loss of power. 13 From a practical point of view, if you are dichotomizing a variable, you don’t truly cut it in half because if you cut a 1-7 point scale from 1-4 and 4-7, for example, a subject who answered “4” is technically in both categories. Thus, when you use the typically create a small degree of separation, such as 1-3.999 and 4.001-7. When splitting a continuous identification variable into two groups, another question is whether you want to have equal N size for just that variable, or have equal N across that variable AND another variable. For example, I conducted a study about how republicans and democrats identify with their political party. Lets say I want to dichotomize my measure of “identification”. When splitting the continuous identification variable into two groups, the question is whether you want to have equal N size for just the identification variable, or have equal N across both identification and the republican v. democrat variable. For example, if you split the identification variable down the middle, you might have many more republicans in the low or high identification condition, and vice versa for democrats. On the other hand, you could split the identification variable separately for republicans and then again democrats, and then combine together, so that way you have equal N across both variables. I believe both are defensible options to choose. My opinion is the first option is the best (grand median or midpoint) because then the high and low groups will have equivalent psychological meaning across party affiliation. In other words, “high” and “low” mean the same thing for both republicans and democrats even if cell size is unequal. 6. Creating new variables based upon the combination of two or more variables Sometimes you want to create a new variable that is a combination of two or more other variables. For example, I conducted a study about how republicans and democrats identify with their political party. For each subject, I asked what is their political party affiliation and how much they identify with that political party. Lets say I want a new variable of only highly identified republicans but lowly identified democrats. In this case I want to create a new variable that is a combination of my two questions. Here is another example: Assume that when I asked my first question about political party affiliation, there were four options – Republican, Democrat, other, none. If I wanted to create a new variable of only highly identified republicans and democrats, I can’t simply cut the “identification” question in half because the top half will contain more than just democrats and republican, it will contain those who responded “none” or “other. In this situation, I need a way to create a new variable that takes into account different option choices. How do I create a new variable based upon the combination of two or more variables? Below, I explain the steps for using the “Compute variable” command. However, I want to first explain conceptually what the task entails. In essence, we are going to tell SPSS to create a new variable that is labeled as “1” if it satisfies certain criteria (such as high on variable 1, but low on variable 2), and then labeled as “2” if it satisfies other criteria (such as high on variable 2, but low on variable 1). In other words, we can specify a long combination of criteria, and have subjects who meet that criteria labeled as 1 or 2 (or 3 or 4, depending on how many categories you want in your new variable). As an example, we could create a new variable that has subjects listed as “1” if they are republican and high identifiers, and subjects listed as “2” if they are democrat and high identifiers. Thus, the new categorical variable will have two categories that are a combination of my two questions about political party affiliation and identification with that political party. You create each new category separately. Thus, in our example about creating a new variable that contains only highly identified republicans and highly identified democrats, we first use the “Compute variable” command to create “1” if highly identified republicans. Then, we repeat the process by using the “Compute variable” command to assign a “2” if highly identified democrats. To transform the variables: 1. Select Transform --> Compute Variable 2. Type a new name for your new variable in the “Target Variable” box. 3. In the “Numeric Expression” box, type the number of a category (e.g., Let’s start by assigning category “1”) 4. Click the “If” button, and click “Include if case satisfied condition”. 5. Move the old variable into the open box, and specify the restriction. (e.g., if identification if “identify” variable, and political party affiliation was “party” variable, then I need to specify only those subjects who are highly identified (e.g., greater than 4 on the “identify” variable) and 14 who are simultaneously republicans (e.g., republicans are labeled “1” on the “party” variable). So, I would type the following into the box -- identify>4 & party=1. 7. Click Continue 8. Click OK. THEN WE REPEAT TO CREATE THE SECOND CATEGORY 1. Select Transform --> Compute Variable 2. Type a SAME name in the “Target Variable” box as you did the first time. 3. In the “Numeric Expression” box, type the number of a category (e.g., “2”) 4. Click the “If” button, and click “Include if case satisfied condition”. 5. Move the old variable into the open box, and specify the restriction. (e.g., if identification if “identify” variable, and political party affiliation was “party” variable, then I need to specify only those subjects who are highly identified (e.g., greater than 4 on the “identify” variable) and who are simultaneously DEMOCRATS (e.g., democrats are labeled “2” on the “party” variable). So, I would type the following into the box -- identify>4 & party=2. 7. Click Continue 8. Click OK. To summary, the “Numeric Expression” box is the number we want to assign in the new category (1 or 2) And, the criteria for who is assigned that number is specified in the “If” box. And, the name of the new categorical variable was labeled in the “Target Variable box The new variable will appear at the end of the data file. 15