Crosstabulation and Measures of Association Investigating the relationship between two variables Generally a statistical relationship exists if the values of the observations for one variable are associated with the values of the observations for another variable Knowing that two variables are related allows us to make predictions. If we know the value of one, we can predict the value of the other. Determining how the values of one variable are related to the values of another is one of the foundations of empirical science. In making such determinations we must consider the following features of the relationship. 1.) The level of measurement of the variables. Difference varibles necessitate different procedures. 2.) The form of the relationship. We can ask if changes in X move in lockstep with changes in Y or if a more sophisticated relationship exists. 3.)The strength of the relationship. Is it possible that some levels of X will always be associated with certain levels of Y? 4.) Numerical Summaries of the relationship. Social scientists strive to boil down the different aspects of a relationship to a single number that reveals the type and strength of the association. 5.) Conditional relationships. The variables X and Y may seem to be related in some fashion but appearances can be deceiving. Spuriousness for example. So we need to know if the introduction of any other variables into the analysis changes the relationship. Types of Association General Association – simply associated in some way. 2.) Positive Monotonic Correlation – when the variables have order (ordinal or continuous) high values of one var are associated with high values of the other. Converse is also true. 3.) Negative Monotonic Correlation – Low values are associated with high values. 1.) Types of Association Cont. Positive Linear Association – A particular type of positive monotonic relationship where the plotted values of XY fall on a straight line that slopes upward. 4.) 5.) Negative Linear Relaionship – Straight line that slopes downward. Strength of Relationships Virtually no relationships between variables in Social Science (and largely in natural science as well) have a perfect form. As a result it makes sense to talk about the strength of relationships. Strength Cont. The strength of a relationship between variables can be found by simply looking at a graph of the data. If the values of X and Y are tied together tightly then the relationship is strong. If the X-Y points are spread out then the relationship is weak. Direction of Relationship We can also infer direction from a graph by simply observing how the values for our variables move across the graph. This is only true, however, when our variables are ordinal or continuous. Types of Bivariate Relationships and Associated Statistics Nominal/Ordinal (including dichotomous) Interval and Dichotomous Difference of means test Interval and Nominal/Ordinal Crosstabulation (Lamda, Chi-Square Gamma, etc.) Analysis of Variance Interval and Ratio Regression and correlation Assessing Relationships between Variables 1. Calculate appropriate statistic to measure the magnitude of the relationship in the sample 2. Calculate additional statistics to determine if the relationship holds for the population of interest (statistical significance) Substantive significance vs. Statistical significance What is a Crosstabulation? Crosstabulations are appropriate for examining relationships between variables that are nominal, ordinal, or dichotomous. Crosstabs show values for variables categorized by another variable. They display the joint distribution of values of the variables by listing the categories for one along the x-axis and the other along the y-axis Each case is then placed in a cell of the table that represents the combination of values that corresponds to its scores on the variables. What is a Crosstabulation? Example: We would like to know if presidential vote choice in 2000 was related to race. Vote choice = Gore or Bush Race = White, Hispanic, Black Are Race and Vote Choice Related? Why? Black Hispanic White TOTAL Gore 106 23 427 556 Bush 8 15 484 507 TOTAL 114 38 911 1063 Are Race and Vote Choice Related? Why? Black Hispanic White TOTAL Gore 106 (93%) 23 (60.5%) 427 (46.9%) 556 (52.3) Bush 8 (7%) 15 (39.5%) 484 (53.1%) 507 (47.7) TOTAL 114 (100%) 38 (100%) 911 (100%) 1063 (100%) Measures of Association for Crosstabulations Purpose – to determine if nominal/ordinal variables are related in a crosstabulation At least one nominal variable Lamda Chi-Square Cramer’s V Two ordinal variables Tau Gamma Measures of Association for Crosstabulations These measures of association provide us with correlation coefficients that summarize data from a table into one number . This is extremely useful when dealing with several tables or very complex tables. These coefficients measure both the strength and direction of an association. Coefficients for Nominal Data When one or both of the variables are nominal, ordinal coefficients cannot be used because there is no underlying ordering. Instead we use PRE tests Lambda (PRE coefficient) PRE Two – Proportional Reduction in Error Rules 1.) Make a prediction on the value of an observation in the absence of no prior information 2.) Given information on a second variable and take it into account in making the prediction. Lambda PRE If the two variables are associated then the use of rule two should lead to fewer errors in your predictions than rule one. How many fewer errors depends upon how closely the variables are associated. PRE = (E1 – E2) / E1 Scale goes from 0 -1 Lambda Lambda is a PRE coefficient and it relies on rules 1 & 2 above. When applying rule one all we have to go on is what proportion of the population fit into one category as opposed to another. So, without any other information, guessing that every observation is in the modal category would give you the best chance of getting the most correct. Why? Think of it like this. If you knew that I tended to make exams where the most often used answer was B, then, without any other information, you would be best served to pick B every time. But, if you know information about each case’s value on another variable, rule two directs you to only look at the members of that new category (variable) and find the modal category (only on that var). Example Suppose a sample of 100 voters and you need to predict how they will vote in the general election. Assume we know that overall 30% voted democrat and 30% voted republican and 40% were independent. Now suppose we take one person out of the group (John Smith), our best guess would be that he would vote independent. Now suppose we take another person (Larry Mendez) and again we would assume he voted independent. As a result our best guess is to predict that all of the voters (all 100) were independent. We are sure to get some wrong but it’s the best we can do over the long run. How many do we get wrong? 60. Suppose now that we know something about the voters regions (where they are from) and we know what proportions various regions voted in the election. NE-30 , MW – 20, SO – 30 , WE - 20 Lamda NE MW 1 REPUB 4 TOTAL 2 12 2 1 2 WE 1 2 6 2 8 1 DEM 1 10 1 IND 2 SO 1 2 1 2 10 2 16 1 TOTAL 30 1 2 4 1 2 40 1 14 2 8 6 30 20 30 20 2 30 100 Lamda – Rule 1 (prediction based solely on knowledge of marginal distribution of dependent variable – partisanship) NE MW 1 REPUB 4 0 1 IND 2 1 2 10 0 2 12 30 1 SO 1 1 1 2 2 2 16 30 1 2 10 0 1 2 TOTAL 1 6 0 8 20 2 WE 1 30 2 4 20 2 1 40 2 DEM 14 0 2 0 8 0 6 0 30 TOTAL 30 20 30 20 100 Lamda – Rule 2 (prediction based on knowledge provided by independent variable ) NE MW 1 REPUB 4 0 1 IND TOTAL 1 2 WE 1 2 0 10 0 20 6 0 2 1 2 TOTAL 1 2 0 10 0 20 1 2 1 2 1 2 1 14 0 30 2 0 0 8 0 30 20 30 2 1 2 0 6 0 0 20 30 2 12 30 0 8 20 0 16 30 30 4 20 0 1 DEM 2 SO 40 30 100 Lamda –Calculation of Errors Errors w/Rule 1: 18 + 12 + 14 + 16 = 60 Errors w/Rule 2: 16 + 10 + 14 + 10 = 50 Lamda =(Errors R1 – Errors R2)/Errors R1 Lamda = (60-50)/60=10/60=.17 NE MW 1 REPUB 4 0 1 IND 2 SO 1 2 0 10 0 20 6 0 0 1 2 12 30 0 8 20 0 1 WE 1 2 2 2 1 1 TOTAL 1 2 10 0 20 30 2 1 2 16 30 30 4 20 0 2 1 2 1 2 DEM 14 0 30 2 0 0 8 0 0 6 0 0 TOTAL 30 30 20 20 40 30 100 Lamda PRE measure Ranges from 0-1 Potential problems with Lamda Underestimates relationship when variables (one or both) are highly skewed Always 0 when modal category of Y is the same across all categories of X Chi –Square (c2) Also appropriate for any crosstabulation with at least one nominal variable (and another nominal/ordinal variable) Based on the difference between the empirically observed crosstab and what we would expect to observe if the two variables are statistically independent Background for c2 Statistical Independence – A property of two variables in which the probability that an observation is in a particular category of on variable and also in a particular category of the other variable equals the simple or marginal probability of being in those categories. Plays a large role in data analysis Is another way to view the strength of a relaitionship Example Suppose we have two nominal or categorical variables, X and Y. We label the categories for the first category (a,b,c) and those of the second (r,s,t). Let P(X = a) stand for the probability that a randomly selected case has property a on variable X and P(Y = r) stand for the probability that a randomly selected case has property r on variable Y. These two probabilities are called marginal distributions and simply refers to the chance that an observation has a particular value on a particular variable irrespective of its value on another variable. Finally, let us assume that P(X = a, Y = r) stands for the joint probability that a randomly selected observation has both property a and property r simultaneously. Statistical Independence – The two variables are therefore statisitically independent only if the chances of observing a combination of categories is equal to the marginal probability of choosing one category times the marginal probability of the other. Background for c2 P(X For = a, Y = r) = [P(X = a)] [P(Y = r)] example, if men are as likely to vote as women, then the two variables (gender and voter turnout) are statistically independent because the probability of observing a male nonvoter in the sample is equal to the probability of observing a male times the probability of obseving a nonvoter. Example If 100/300 are men & 210/300 voted then; The marginal probabilities are: P(X=m)=100/300 = .33 and P(Y=v) = 210/300 = .7 .33 x .7 = .23 and is our marginal probability If we know that 70 of the voters are male and take that proportion and divide by the total number of voters (70/300) we also get .23. We can therefore say that the two variables are independent. The chi-squared statistic essentially compares an observed result (the table produced by the sample) with a hypothetical table that would occur if (in the population) the variables were statistically independent. A value of 0 implies statistical independence which means no association. Chi-squared increases as the departures of observed and expected values grows. There is no upper limit to how big the difference can become but if it is past a critical value then there is reason to reject the null hypothesis that the two variables are independent. How do we Calc. Chi^2 The observed frequencies are already in the crosstab. The expected frequencies in each table cell are found by multiplying the row and the column marginal totals and dividing by the sample size. Chi –Square (c2) NE O REPUB E 4 O IND MW E 10 E 12 O O SO O O E 6 E 8 E O WE O O E 10 E 16 E O TOTAL O 30 E 4 E O 40 E DEM 14 2 8 6 30 TOTAL 30 20 30 20 100 Calculating Expected Frequencies To calculate the expected cell frequency for NE Republicans: • E/30 = 30/100, therefore E=(30*30)/100 = 9 NE O REPUB E 4 O IND MW E 10 E 12 O O SO O O E 6 E 8 E O WE O O E 10 E 16 E O TOTAL O 30 E 4 E O 40 E DEM 14 2 8 6 30 TOTAL 30 20 30 20 100 Calculating the Chi-Square Statistic The chi-square statistic is calculated as: (Obs. Frequencyik - Exp. Frequencyik)2 / Exp. Frequencyik (25/9)+(16/6)+(9/9)+(16/6)+(0)+(0)+(16/12)+(16/8)+(25/9)+16/6)+(1/9)+(0) = 18 NE MW SO WE TOTAL O E O E O E O E 4 9 10 6 6 9 10 6 O E O E O E O E 12 12 8 8 16 12 4 8 O E O E O E O E DEM 14 9 2 6 8 9 6 6 TOTAL 30 REPUB IND 20 30 20 30 40 30 100 The value 9, is the expected frequency in the first cell of the table and is what we would expect in a sample of 100 (with 30 Republicans and 30 north easterners) if there is statistical independence in the population. This is more than we have in our sample so there is a difference. Just Like the Hyp. Test Null : Statistical Independence between x and Y Alt : X and Y are not independent. Interpreting the Chi-Square Statistic The Chi-Square statistic ranges from 0 to infinity 0 = perfect statistical independence Even though two variables may be statistically independent in the population, in a sample the Chi-Square statistic may be > 0 Therefore it is necessary to determine statistical significance for a Chi-Square statistic (given a certain level of confidence) Cramer’s V Problem with Chi-Square: not comparable across different sample sizes (and their associated crosstab) Cramer’s V is a standardization of the ChiSquare statistic Calculating Cramer’s V Chi Squared N Min( R 1, C 1) V= Where R = #rows and C = #columns • V ranges from 0-1 Example (region and partisanship) = = √.09 = .30 18 100 (3 1) Relationships between Ordinal Variables There are several measures of association appropriate for relationships between ordinal variables Gamma, All Tau-b, Tau-c, Somer’s d are based on identifying concordant, discordant, and tied pairs of observations Concordant Pairs: Ideology and Voting Ideology - conserv (1), moderate (2), liberal (3) Voting - never (1), sometimes (2), often (3) Consider two hypothetical individuals in the sample with scores • Individual A: Ideology=1, Voting=1 • Individual B: Ideology=2, Voting=2 • Pair A&B are considered a concordant pair because B’s ideology score is greater than A’s score, and B’s voting score is greater than A’s score Concordant Pairs (cont’d) All of the following are concordant pairs A(1,1) B(2,2) A(1,1) B(2,3) A(1,1) B(3,2) A(1,2) B(2,3) A(2,2) B(3,3) Concordant pairs are consistent with a positive relationship between the IV and the DV (ideology and voting) Discordant Pairs All of the following are discordant pairs A(1,2) B(2,1) A(1,3) B(2,2) A(2,2) B(3,1) A(1,2) B(3,1) A(3,1) B(1,2) Discordant pairs are consistent with a negative relationship between the IV and the DV (ideology and voting) Identifying Concordant Pairs Concordant Pairs for Never - Conserv (1,1) #Concordant = 80*70 + 80*10 + 80*20 + 80*80 = 14,400 Conservative (1) Moderate (2) Liberal (3) Never (1) 80 10 10 Sometimes (2) 20 70 10 Often (3) 0 20 80 Identifying Concordant Pairs Concordant Pairs for Never - Moderate (1,2) #Concordant = 10*10 + 10*80 = 900 Conservative (1) Moderate (2) Liberal (3) Never (1) 80 10 10 Sometimes (2) 20 70 10 Often (3) 0 20 80 Identifying Discordant Pairs Discordant Pairs for Often - Conserv (1,3) #Discordant = 0*10 + 0*10 + 0*70 + 0*10 = 0 Conservative (1) Moderate (2) Liberal (3) Never (1) 80 10 10 Sometimes (2) 20 70 10 Often (3) 0 20 80 Identifying Discordant Pairs Discordant Pairs for Often - Moderate (2,3) #Discordant = 20*10 + 20*10 Conservative (1) Moderate (2) Liberal (3) Never (1) 80 10 10 Sometimes (2) 20 70 10 Often (3) 0 20 80 Gamma Gamma is calculated by identifying all possible pairs of individuals in the sample and determining if they are concordant or discordant Gamma = (#C - #D) / (#C + #D) Interpreting Gamma Gamma = 21400/24400 =.88 Gamma ranges from -1 to +1 Gamma does not account for tied pairs Tau (b and c) and Somer’s d account for tied pairs in different ways Square tables: Non-Square tables: Example 2004 – What explains variation in one’s political Ideology? NES Income? Education? Religion? Race? Bivariate Relationships and Hypothesis Testing (Significance Testing) 1. Determine the null and alternative hypotheses • Null: There is no relationship between X and Y (X and Y are statistically independent and test statistic = 0). • Alternative: There IS a relationship between X and Y (test statistic does not equal 0). Bivariate Relationships and Hypothesis Testing 2. Determine Appropriate Test Statistic (based on measurement levels of X and Y) 3. Identify the type of sampling distribution for test statistic, and what it would look like if the null hypothesis were true. Bivariate Relationships and Hypothesis Testing 4. Calculate the test statistic from the sample data and determine the probability of observing a test statistic this large (in absolute terms) if the null hypothesis is true. P-value (significance level) – probability of observing a test statistic at least as large as our observed test statistic, if in fact the null hypothesis is true Bivariate Relationships and Hypothesis Testing 5. Choose an “alpha level” – a decision rule to guide us in determining which values of the pvalue lead us to reject/not reject the null hypothesis When the p-value is extremely small, we reject the null hypothesis (why?). The relationship is deemed “statistically significant,” When the p-value is not small, we do not reject the null hypothesis (why?). The relationship is deemed “statistically insignificant.” Most common alpha level: .05 Bottom Line Assuming we will always use an alpha level of .05: Reject the null hypothesis if P-value<.05 Do not reject the null hypothesis if Pvalue>.05 An Example Dependent variable: Vote Choice in 2000 (Gore, Bush, Nader) Independent variable: Ideology (liberal, moderate, conservative) An Example 1. Determine the null and alternative hypotheses. An Example Null Hypothesis: There is no relationship between ideology and vote choice in 2000. Alternative (Research) Hypothesis: There is a relationship between ideology and vote choice (liberals were more likely to vote for Gore, while conservatives were more likely to vote for Bush). An Example 2. Determine Appropriate Test Statistic (based on measurement levels of X and Y) 3. Identify the type of sampling distribution for test statistic, and what it would look like if the null hypothesis were true. Sampling Distributions for the Chi-Squared Statistic (under assumption of perfect independence) df = (rows-1)(columns-1) Bivariate Relationships and Hypothesis Testing 4. Calculate the test statistic from the sample data and determine the probability of observing a test statistic this large (in absolute terms) if the null hypothesis is true. P-value (significance level) – probability of observing a test statistic at least as large as our observed test statistic, if in fact the null hypothesis is true Bivariate Relationships and Hypothesis Testing 5. Choose an “alpha level” – a decision rule to guide us in determining which values of the pvalue lead us to reject/not reject the null hypothesis When the p-value is extremely small, we reject the null hypothesis (why?). The relationship is deemed “statistically significant,” When the p-value is not small, we do not reject the null hypothesis (why?). The relationship is deemed “statistically insignificant.” Most common alpha level: .05 In-Class Exercise For some years now, political commentators have cited the importance of a “gender gap” in explaining election outcomes. What is the source of the gender gap? Develop a simple theory and corresponding hypothesis (where gender is the independent variable) which seeks to explain the source of the gender gap. Specifically, determine: Theory Null and research hypothesis Test statistic for a cross-tabulation to test your hypothesis