Quantitative Data Analysis Edouard Manet: In the Conservatory, 1879 Quantification of Data 1. Introduction • To conduct quantitative analysis, responses to open-ended questions in survey research and the raw data collected using qualitative methods must be coded numerically. Quantification of Data 1. Introduction (Continued) • Most responses to survey research questions already are recorded in numerical format. • In mailed and face-to-face surveys, responses are keypunched into a data file. • In telephone and internet surveys, responses are automatically recorded in numerical format. Quantification of Data 2. Developing Code Categories • Coding qualitative data can use an existing scheme or one developed by examining the data. • Coding qualitative data into numerical categories sometimes can be a straightforward process. • Coding occupation, for example, can rely upon numerical categories defined by the Bureau of the Census. Quantification of Data 2. Developing Code Categories (Continued) • Coding most forms of qualitative data, however, requires much effort. • This coding typically requires using an iterative procedure of trial and error. • Consider, for example, coding responses to the question, “What is the biggest problem in attending college today.” • The researcher must develop a set of codes that are: • exhaustive of the full range of responses. • mutually exclusive (mostly) of one another. Quantification of Data 2. Developing Code Categories (Continued) • In coding responses to the question, “What is the biggest problem in attending college today,” the researcher might begin, for example, with a list of 5 categories, then realize that 8 would be better, then realize that it would be better to combine categories 1 and 5 into a single category and use a total of 7 categories. • Each time the researcher makes a change in the coding scheme, it is necessary to restart the coding process to code all responses using the same scheme. Quantification of Data 2. Developing Code Categories (Continued) • Suppose one wanted to code more complex qualitative data (e.g., videotape of an interaction between husband and wife) into numerical categories. • How does one code the many statements, facial expressions, and body language inherent in such an interaction? • One can realize from this example that coding schemes can become highly complex. Quantification of Data 2. Developing Code Categories (Continued) • Complex coding schemes can take many attempts to develop. • Once developed, they undergo continuing evaluation. • Major revisions, however, are unlikely. • Rather, new coders are required to learn the existing coding scheme and undergo continuing evaluation for their ability to correctly apply the scheme. Quantification of Data 3. Codebook Construction • The end product of developing a coding scheme is the codebook. • This document describes in detail the procedures for transforming qualitative data into numerical responses. • The codebook should include notes that describe the process used to create codes, detailed descriptions of codes, and guidelines to use when uncertainty exists about how to code responses. Quantification of Data 4. Data Entry • Data recorded in numerical format can be entered by keypunching or the use of sophisticated optical scanners. • Typically, responses to internet and telephone surveys are entered directly into a numerical data base. 5. Cleaning Data • Logical errors in responses must be reconciled. • Errors of entry must be corrected. Quantification of Data 6. Collapsing Response Categories • Sometimes the researcher might want to analyze a variable by using fewer response categories than were used to measure it. • In these instances, the researcher might want to “collapse” one or more categories into a single category. • The researcher might want to collapse categories to simplify the presentation of the results or because few observations exist within some categories. Quantification of Data 6. Collapsing Response Categories: Example Response Strongly disagree Disagree Neither agree nor disagree Agree Strongly Agree Frequency 2 22 45 31 1 Quantification of Data 6. Collapsing Response Categories: Example One might want to collapse the extreme responses and work with just three categories: Response Disagree Neither agree nor disagree Agree Frequency 24 45 32 Quantification of Data 7. Handling “Don’t Knows” • When asking about knowledge of factual information (“Does your teenager drink alcohol?”) or opinions on a topic the subject might not know much about (“Do school officials do enough to discourage teenagers from drinking alcohol?”), it is wise to include a “don’t know” category as a possible response. • Analyzing “don’t know” responses, however, can be a difficult task. Quantification of Data 7. Handling “Don’t Knows” (Continued) • The research-on-research literature regarding this issue is complex and without clear-cut guidelines for decision-making. • The decisions about whether to use “don’t know” response categories and how to code and analyze them tends to be idiosyncratic to the research and the researcher. Quantitative Data Analysis • Descriptive statistics attempt to explain or predict the values of a dependent variable given certain values of one or more independent variables. • Inferential statistics attempt to generalize the results of descriptive statistics to a larger population of interest. Quantitative Data Analysis 1. Data Reduction • The first step in quantitative data analysis is to calculate descriptive statistics about variables. • The researcher calculates statistics such as the mean, median, mode, range, and standard deviation. • Also, the researcher might choose to collapse response categories for variables. Quantitative Data Analysis 2. Measures of Association • Next, the researcher calculates measures of association: statistics that indicate the strength of a relationship between two variables. • Measures of association rely upon the basic principle of proportionate reduction in error (PRE). Quantitative Data Analysis 2. Measures of Association (Continued) • PRE represents how much better one would be at guessing the outcome of a dependent variable by knowing a value of an independent variable. • For example: How much better could I predict someone’s income if I knew how many years of formal education they have completed? If the answer to this question is “37% better,” then the PRE is 37%. Quantitative Data Analysis 2. Measures of Association (Continued) • Statistics are designated by Greek letters. • Different statistics are used to indicate the strength of association between variables measured at different levels of data. • Strength of association for nominal-level variables is indicated by λ (lambda). • Strength of association for ordinal-level variables is indicated by γ (gamma). • Strength of association for interval-level variables is indicated by correlation (r). Quantitative Data Analysis 2. Measures of Association (Continued) • Covariance is the extent to which two variables “change with respect to one another.” • As one variable increases, the other variable either increases (positive covariance) or decreases (negative covariance). • Correlation is a standardized measure of covariance. • Correlation ranges from -1 to +1, with figures closer to one indicating a stronger relationship. Quantitative Data Analysis 2. Measures of Association (Continued) • Technically, covariance is the extent to which two variables co-vary about their means. • If a person’s years of formal education is above the mean of education for all persons and his/her income is above the mean of income for all persons, then this data point would indicate positive covariance between education and income. Statistics 1. Introduction • To make inferences from descriptive statistics, one has to know the reliability of these statistics. • In the same sense that the distribution of one variable has a standard deviation, a parameter estimate has a standard error—the distribution of the estimate from its mean with respect to the normal curve. Statistics 1. Introduction (Continued) • To better understand the concepts standard deviation and standard error, and why these concepts are important to our course, please review the presentation regarding standard error. • Presentation on Standard Error. Statistics 2. Types of Analysis • The presentation on inferential statistics will cover univariate, bivariate and multivariate analysis. • Univariate Analysis: • Mean. • Median. • Mode. • Standard deviation. Statistics 2. Types of Analysis (Continued) • Bivariate Analysis • Tests of statistical significance. • Chi-square. • Multivariate Analysis: • Ordinary least squares (OLS) regression. • Path analysis. • Time-series analysis. • Factor analysis. • Analysis of variance (ANOVA). Univariate Analysis 1. Distributions • Data analysis begins by examining distributions. • One might begin, for example, by examining the distribution of responses to a question about formal education, where responses are recorded within six categories. • A frequency distribution will show the number and percent of responses in each category of a variable. Univariate Analysis 2. Central Tendency • A common measure of central tendency is the average, or mean, of the responses. • The median is the value of the “middle” case when all responses are rank-ordered. • The mode is the most common response. • When data are highly skewed, meaning heavily balanced toward one end of the distribution, the median or mode might better represent the “most common” or “centered” response. Univariate Analysis 2. Central Tendency (Continued) • Consider this distribution of respondent ages: • 18, 19, 19, 19, 20, 20, 21, 22, 85 • The mean equals 27. But this number does not adequately represent the “common” respondent because the one person who is 85 skews the distribution toward the high end. • The median equals 20. • This measure of central tendency gives a more accurate portrayal of the “middle of the distribution.” Univariate Analysis 3. Dispersion • Dispersion refers to the way the values are distributed around some central value, typically the mean. • The range is the distance separating the lowest and highest values (e.g., the range of the ages listed previously equals 18-85). • The standard deviation is an index of the amount of variability in a set of data. Univariate Analysis 3. Dispersion (Continued) • The standard deviation represents dispersion with respect to the normal (bell-shaped) curve. • Assuming a set of numbers is normally distributed, then each standard deviation equals a certain distance from the mean. • Each standard deviation (+1, +2, etc.) is the same distance from each other on the bellshaped curve, but represents a declining percentage of responses because of the shape of the curve (see: Chapter 7). Univariate Analysis 3. Dispersion (Continued) • For example, the first standard deviation accounts for 34.1% of the values below and above the mean. • The figure 34.1% is derived from probability theory and the shape of the curve. • Thus, approximately 68% of all responses fall within one standard deviation of the mean. • The second standard deviation accounts for the next 13.6% of the responses from the mean (27.2% of all responses), and so on. Univariate Analysis 3. Dispersion (Continued) • If the responses are distributed approximately normal and the range of responses is low— meaning that most responses fall close to the mean—then the standard deviation will be small. • The standard deviation of professional golfer’s scores on a golf course will be low. • The standard deviation of amateur golfer’s scores on a golf course will be high. Univariate Analysis 4. Continuous and Discrete Variables • Continuous variables have responses that form a steady progression (e.g., age, income). • Discrete (i.e., categorical) variables have responses that are considered to be separate from one another (i.e., sex of respondent, religious affiliation). Univariate Analysis 4. Continuous and Discrete Variables • Sometimes, it is a matter of debate within the community of scholars about whether a measured variable is continuous or discrete. • This issue is important because the statistical procedures appropriate for continuous-level data are more powerful, easier to use, and easier to interpret than those for discrete-level data, especially as related to the measurement of the dependent variable. Univariate Analysis 4. Continuous and Discrete Variables (Continued) • Example: Suppose one measures amount of formal education within five categories: less than hs, hs, 2-years vocational/college, college, post-college). • Is this measure continuous (i.e., 1-5) or discrete? • In practice, five categories seems to be a cutoff point for considering a variable as continuous. • Using a seven-point response scale will give the researcher a greater chance of deeming a variable to be continuous. Bivariate Analysis 1. Introduction • Bivariate analysis refers to an examination of the relationship between two variables. • We might ask these questions about the relationship between two variables: • Do they seem to vary in relation to one another? That is, as one variable increases in size does the other variable increase or decrease in size? • What is the strength of the relationship between the variables? Bivariate Analysis 1. Introduction (Continued) • Divide the cases into groups according to the attributes of the independent variable (e.g., men and women). • Describe each subgroup in terms of attributes of the dependent variable (e.g., what percent of men approve of sexual equality and what percent of women approve of sexual equality). Bivariate Analysis 1. Introduction (Continued) • Read the table by comparing the independent variable subgroups with one another in terms of a given attribute of the dependent variable (e.g., compare the percentages of men and women who approve of sexual equality). • Bivariate analysis gives an indication of how the dependent variable differs across levels or categories of an independent variable. • This relationship does not necessarily indicate causality. Bivariate Analysis 1. Introduction (Continued) • Tables that compare responses to a dependent variable across levels/categories of an independent variable are called contingency tables (or sometimes, “crosstabs”). • When writing a research report, it is common practice, even when conducting highly sophisticated statistical analysis, to present contingency tables also to give readers a sense of the distributions and bivariate relationships among variables. Bivariate Analysis 2. Tests of Statistical Significance • If one assumes a normal distribution, then one can examine parameters and their standard errors with respect to the normal curve to evaluate whether an observed parameter differs from zero by some set margin of error. • Assume that the researcher sets the probability of a Type-1 error (i.e., the probability of assuming causality when there is none) at 5%. • That is, we set our margin of error very low, just 5%. Bivariate Analysis 2. Tests of Statistical Significance (Continued) • To evaluate statistical significance, the researcher compares a parameter estimate to a “zero point” on a normal curve (its center). • The question becomes: Is this parameter estimate sufficiently large, given its standard error, that, within a 5% probability of error, we can state that it is not equal to zero? Bivariate Analysis 2. Tests of Statistical Significance (Continued) • To achieve a probability of error of 5%, the parameter estimate must be almost two (i.e., 1.96) standard deviations from zero, given its standard error. • Sometimes in sociological research, scholars say “two standard deviations” in referring to a 5% error rate. Most of the time, they are more precise and state 1.96. Bivariate Analysis 2. Tests of Statistical Significance (Continued) • Consider this example: • Suppose the unstandardized estimate of the effect of self-esteem on marital satisfaction equals 3.50 (i.e., each additional amount of self-esteem on its scale results in 3.50 additional amount of marital satisfaction on its scale). • Suppose the standard error of this estimate equals 1.20. Bivariate Analysis 2. Tests of Statistical Significance (Continued) • If we divide 3.50 by 1.20 we obtain the ratio of 2.92. This figure is called a t-ratio (or, t-value). • The figure 2.92 means that the estimate 3.50 is 2.92 standard deviations from zero. • Based upon our set margin of error of 5% (which is equivalent to 1.96 standard deviations), we can state that at prob. < .05, the effect of self-esteem on marital satisfaction is statistically significant. Bivariate Analysis 2. Tests of Statistical Significance (Continued) • The t-ratio is the ratio of a parameter estimate to its standard error. • The t-ratio equals the number of standard deviations that an estimate lies from the “zero point” (i.e., center) of the normal curve. Bivariate Analysis 2. Tests of Statistical Significance (Continued) • Why do we state that we need to have 1.96 standard deviations from the zero point of the normal curve? • Recall the area beneath the normal curve: • The first standard deviation covers 34.1% of the observations on one side of the zero point. • The second standard deviation covers the next 13.6% of the observations. Bivariate Analysis 2. Tests of Statistical Significance (Continued) • Let’s assume for a moment that our estimate is greater than the “real” effect of self-esteem on marital satisfaction. • Then, at 1.96 standard deviations, we have covered the 50% probability below the “real” effect, and we have covered 34.1% + 13.4% probability above this effect. • In total, we have accounted for 97.5% of the probability that our estimate does not equal zero. Bivariate Analysis 2. Tests of Statistical Significance (Continued) • That leaves 2.5% of the probability above the “real” estimate. • But we have to recognize that our estimate might have fallen below the “real” estimate. • So, we have the probability of error on both sides of “reality.” • 2.5% + 2.5% equals 5% • This is our set margin of error! Bivariate Analysis 2. Tests of Statistical Significance (Continued) • Thus, inferential statistics are calculated with respect to the properties of the normal curve. • There are other types of distributions besides the normal curve, but the normal distribution is the one most often used in sociological analysis. Bivariate Analysis 2. Tests of Statistical Significance (Continued) • If we know the properties of the normal curve, and we have calculated an estimate of a parameter, and we know the standard error of this estimate (e.g., the range of values that the estimate might be), then we can calculate statistical significance. • Recall that statistical significance does not necessarily equal substantive significance. Bivariate Analysis 3. Chi-Square • Chi-square is a test of independence between two variables. • Typically, one is interested in knowing whether an independent variable (x) “has some effect” on a dependent variable (y). • Said another way, we want to know if y is independent of x (e.g., if it goes its own way regardless of what happens to x). • Thus, we might ask, “Is church attendance independent of the sex of the respondent?” Bivariate Analysis 3. Chi-Square (Continued) • Scenario 1: Consider these data on sex of the subject and church attendance: Sex Male Female Total: Church Attendance Yes No Total 28 12 40 42 18 60 70 30 100 Bivariate Analysis 3. Chi-Square (Continued) • Note that: • 70% of all persons attend church. • 70% of men attend church. • 70% of women attend church. • Thus, we can say that church attendance is independent of the sex of the respondent because, if the total number of church goers equals 70%, then, with independence, we expect 70% of men and 70% of women to attend church, and they do. Bivariate Analysis 3. Chi-Square (Continued) • Scenario 2: Now, suppose we observed this pattern of church attendance: Sex Male Female Total: Church Attendance Yes No Total 20 20 40 50 10 60 70 30 100 Bivariate Analysis 3. Chi-Square (Continued) • Note that: • 70% of all persons attend church. • Therefore, if church attendance is independent of the sex of the respondent, then we expect 70% of the men and 70% of the women to attend church. • But they do not. • Instead, 50% of the men attend church and 83.3% of the women attend church. Bivariate Analysis 3. Chi-Square (Continued) • So, for this second set of data, is church attendance independent of the sex of the respondent? • Let’s begin by calculating how much error we would make by assuming men and women behave as expected. • That is, for each cell of the table, we will calculate the difference between the observed and expected values. Bivariate Analysis 3. Chi-Square (Continued) • Observed in Red • Expected in White Sex Male Female Church Attendance Yes No 20-28 = -8 20-12 = 8 50-42 = 8 10-18 = -8 Bivariate Analysis 3. Chi-Square (Continued) • Note that in each cell, if we assume independence, we make a mistake equal to “8” (sometimes positive and sometimes negative). • If we add all of our mistakes, we obtain a sum of zero, which we know is not true. • So, we will square each mistake to give every number a positive valence. Bivariate Analysis 3. Chi-Square (Continued) • How badly did we do in each cell? • To know the magnitude of our mistake in each cell, we will divide the size of the mistake by the expected value in the cell (a PRE measure). • The following table shows our proportionate degree of error in each cell and our total amount of proportionate error for the entire table. Bivariate Analysis 3. Chi-Square (Continued) • Proportionate error is calculated for each cell: Sex Male Female Church Attendance Yes No (-8 )2 / 28 = 2.29 (8)2 / 12 = 5.33 (8)2 / 42 = 1.52 (-8)2 / 18 = 3.56 The total of all proportionate error = 12.70. This is the chi-square value for this table. Bivariate Analysis 3. Chi-Square (Continued) • Our chi-square value of 12.70 gives us a number that summarizes our proportionate amount of mistakes for the whole table. • Is this number big enough to indicate a lack of independence between church attendance and sex of the respondent? • To make this assessment, we compare our observed chi-square with a standardized distribution of PRE measures: the chi-square distribution. Bivariate Analysis 3. Chi-Square (Continued) • The chi-square distribution looks like a lopsided version of the normal curve. • To compare our observed chi-square with this distribution, we need some indication of where we should be on the distribution, as we did with standard errors on the normal curve. • On the chi-square distribution, we are “allowed” a certain amount of error depending upon our degrees of freedom. Bivariate Analysis 3. Chi-Square (Continued) • To understand degrees of freedom, reconsider our table on observed church attendance: Sex Male Female Total: Church Attendance Yes No Total 20 20 40 50 10 60 70 30 100 Given the margin totals, once we fill in one cell with the correct number, all the other cells are given. Bivariate Analysis 3. Chi-Square (Continued) • A degree of freedom is the number of correct guesses one must make to reach a point where all the other cells are given. • Our table has one degree of freedom. • The more correct guesses one must make, the greater the degrees of freedom and the more proportionate amount of error one is “allowed” within the chi-square distribution before claiming a lack of independence. Bivariate Analysis 3. Chi-Square (Continued) • The amount of chi-square we are allowed, at a probability of error set to 5%, for one degree of freedom, equals 3.841. • Our chi-square exceeds this amount. Thus, we can claim a lack of independence between church attendance and sex of the subject at a probability of error equal to less than 5%. Bivariate Analysis 3. Chi-Square (Continued) • Are you wondering where the number 3.841 comes from? It is 1.96 squared. • Remember 1.96? It is the number of standard deviations within the normal curve that indicates a 5% Type-I error rate. • The t-ratios for the effects of the independent variables in regression analysis each had one degree of freedom. • So, we are working with the same principles we used for the normal curve, but with a different distribution: the chi-square distribution. Bivariate Analysis 4. Some Words of Caution 1. Recognize that statistical significance does not necessarily mean that one has substantive significance. 2. Statistical significance refers to mistakes made from sampling error only. 3. Tests of statistical significance depend upon assumptions about sampling and distributions of data, which are not always met in practice. Multivariate Analysis 1. Regression Analysis • Regression analysis is a procedure for estimating the outcome of a dependent variable based upon the value of an independent variable. • Thus, for just two variables, regression analysis is the same as analysis using the covariance or correlation between the variables. Multivariate Analysis 1. Regression Analysis (Continued) • Typically, regression analysis is used to simultaneously examine the effects of more than one independent variable on a dependent variable. • One might want to know, for example, the ability to predict income by knowing the education, age, race, and sex of the respondent. Multivariate Analysis 1. Regression Analysis (Continued) • The statistic used to summarize the total PRE of multiple variables is the correlation squared, or R-square. • R-square represents the total variance explained in the dependent variable. • It represents “how well we did” in explaining the topic we wanted to explain. Multivariate Analysis 1. Regression Analysis (Continued) • R-square ranges from 0 to +1, wherein the larger the value of R-square, the greater the predictive ability of the independent variables. • The predictive ability of each variable is indicated by the statistic β (beta). Multivariate Analysis 1. Regression Analysis (Continued) • Consider this equation: • y = α + β1 x 1 + β2 x 2 + β3 x 3 + β4 x 4 + ε • • • • • where: y = the value of the dependent variable, α = the intercept, or “starting point” of y, βi = the strength of the effect of xi on y, ε = the amount of error in the prediction of y. Multivariate Analysis 1. Regression Analysis (Continued) • β is called a parameter estimate. It represents the amount of change in y for a one unit change in x. • For example, a beta of .42 would mean that for each one unit change in x (e.g., education) we would expect to observe a .42 unit change in y. Multivariate Analysis 1. Regression Analysis (Continued) • For the example we discussed earlier, we can rewrite the equation as: • Income = + 1education + β2age + β3race + β4sex + ε • where each of the beta’s (β) ranges in size from - to + to let us know the direction and strength of the relationship between each independent variable and income. Multivariate Analysis 1. Regression Analysis (Continued) • In standardized form, this equation is: • Income = β*1education + β*2age + β*3race + β*4sex + ε • where each of the standardized beta’s (β*) ranges in size from -1 to +1. • Note that the intercept () is omitted because, in standardized form, it equals zero. Multivariate Analysis 1. Regression Analysis (Continued) • Each of the beta terms in these equations represents the partial effect of the variable on the dependent variable, meaning the effect of the independent variable on y after controlling for the effects of all other variables on y. • The partial effects of independent variables in explaining the variance in a dependent variable can be visualized by thinking about the contributions of each player on a basketball team to the overall team performance. Multivariate Analysis 1. Regression Analysis (Continued) • Suppose the team wins, 65-60. The player at center is the leading scorer with 18 points. • So, we might say that the center is the most important contributor to the win. “Not so fast,” says regression analysis. • Regression analysis also wants to know the contributions of the other players on the team and how they helped the center. Multivariate Analysis 1. Regression Analysis (Continued) • Suppose that the point guard had 10 assists, 8 of which went to the center. Eight times the point guard drove the lane and then passed the ball to the center for an easy layup, accounting for 16 of the 18 points scored by the center. • To best understand the contributions of the center, we would calculate the contributions of the center while “controlling for” the contributions of the point guard. Multivariate Analysis 1. Regression Analysis (Continued) • Similarly, regression analysis shows the contribution to R-square for each variable, while controlling for the contributions of the other variables. • The contribution of each variable in explaining variance in the dependent variable is summarized as a partial beta coefficient. Multivariate Analysis 1. Regression Analysis (Continued) • In summary, regression analysis provides two indications of our ability to explain how societies work: • The R-Square shows how much variance is explained in the dependent variable. • The standardized beta’s (parameter estimates) show the partial effects of the independent variables in explaining the dependent variable. Multivariate Analysis 1. Regression Analysis (Continued) • The graphic shown on the next slide shows a diagram of a regression of education (x) on income (y). • The regression equation (Y2) is shown as bluecolored line. The intercept (α) is located where the regression line meets the y axis. • The slope of the line is the beta coefficient (β), which equals .42. Multivariate Analysis 1. Regression Analysis (Continued) Multivariate Analysis 1. Regression Analysis (Continued) • We would interpret the results of the regression equation shown on the preceding slide in this manner: “A one unit change in education will result in a .42 unit change in income.” • We can adjust this interpretation into actual units of education and income as we measured them in our study, to state, for example, “Each additional year of education results in an additional $4,200 in annual income.” Multivariate Analysis 1. Regression Analysis (Continued) • One should be cautious about interpreting the results of regression analysis: • A high R-square value does not necessarily mean that the researcher can be confident of knowing cause and effect. • Predictions regarding the dependent variable are valid only within the range of the independent variables used in the regression analysis. Multivariate Analysis 1. Regression Analysis (Continued) • The preceding discussion has focused upon linear regression. • Regression lines can be curvilinear or some combination of straight and curved lines. Multivariate Analysis 2. Path Analysis • Path analysis is the simultaneous calculation of regression coefficients within a complex model of direct and indirect relationships. • The example of an elaboration model regarding the success of women-owned businesses is an example of path analysis . • Path analysis is a very powerful tool for examining cause and effect within a complex theoretical model. Multivariate Analysis 3. Time-Series Analysis • Time-series analysis uses comparisons of statistics and/or parameter estimates across time to learn how changes in the independent variable(s) affect changes in the dependent variable(s). • Time-series analysis, when the data are available, can be a powerful tool for gaining a stronger indication of cause and effect than one learns from a cross-sectional analysis. Multivariate Analysis 4. Factor Analysis • Factor analysis indicates the extent to which a set of variables measures the same underlying concept. • This procedure assesses the extent to which variables are highly correlated with one another compared with other sets of variables. • Consider the table of correlations (i.e., a “correlation matrix”) on the following slide: Multivariate Analysis 4. Factor Analysis (Continued) X1 X2 X3 X4 X5 X6 X1 1 .52 .60 .21 .15 .09 X2 .52 1 .59 .12 .13 .11 X3 .60 .59 1 .08 .10 .10 X4 .21 .12 .08 1 .72 .70 X5 .15 .13 .10 .72 .68 .73 X6 .09 .11 .10 .70 .73 1 Multivariate Analysis 4. Factor Analysis (Continued) • Note that variables X1-X3 are moderately correlated with one another, but have weak correlations with variables X4-X6. • Similarly, variables X4-X6 are moderately correlated with one another, but have weak correlations with variables X1-X3. • The figures in this table indicate that variables X1-X3 “go together” and variables X4-X6 “go together.” Multivariate Analysis 4. Factor Analysis (Continued) • Factor analysis would separate variables X1X3 into “Factor 1” and variables X4-X6 into “Factor 2.” • Suppose variables X1-X3 were designed by the researcher to measure self-esteem and variables X4-X6 were designed to measure marital satisfaction. Multivariate Analysis 4. Factor Analysis (Continued) • The researcher could use the results of factor analysis, including the statistics produced by it, to evaluate the construct validity of using X1X3 to measure self-esteem and using X4-X6 to measure marital satisfaction. • Thus, factor analysis can be a useful tool for confirming the validity of measures of latent variables. Multivariate Analysis 4. Factor Analysis (Continued) • Factor analysis can be used also for exploring groupings of variables. • Suppose a researcher has a list of 20 statements that measure different opinions about same-sex marriage. • The researcher might wonder if the 20 opinions might reflect a fewer number of “basic” opinions. Multivariate Analysis 4. Factor Analysis (Continued) • Factor analysis of responses to these statements might indicate, for example, that they can be reduced into three latent variables, related to religious beliefs, beliefs about civil rights, and beliefs about sexuality. • Then, the researcher can create scales of the grouped variables to measure religious beliefs, civil beliefs, and beliefs about sexuality to examine support for same-sex marriage. Multivariate Analysis 5. Analysis of Variance • Analysis of variance (ANOVA) examines whether a difference in the mean value for one group differs from that of another group. • Is the mean income for males, for example, statistically different from the mean income for females? Multivariate Analysis 5. Analysis of Variance (Continued) • For examining mean differences across just one other variable, the researcher uses oneway ANOVA, which is equivalent to a t-test. • For two or more other variables, the researcher uses two-way ANOVA. The researcher might be interested, for example, in knowing how mean incomes differ based upon sex of subject and level of education. Multivariate Analysis 5. Analysis of Variance (Continued) • The logic of a statistical test of a difference in means is identical to that of testing whether an estimate differs from zero, except that the comparison point is the mean of the other group rather than zero. • Rather than using just the estimate and its standard error for a single group, the procedure is to use the estimates and standard errors of two groups to assess statistical significance. Multivariate Analysis 5. Analysis of Variance (Continued) • Suppose we wanted to know if the mean height of male ISU students differs significantly from the mean height of female ISU students. • Rather than comparing the mean height of male ISU students to a hypothetical zero point, we would compare it to the mean height of female ISU students, where this comparison takes place within the context of standard errors and the shape of the normal curve. Multivariate Analysis 5. Analysis of Variance (Continued) • Suppose we find in our sample of 100 female ISU students that their mean height equals 65 inches with a standard error of 1.5 inches. These figures indicate that most females (68.2%) are 63.5 to 66.5 inches in height. • Suppose that a sample of 100 male ISU students shows a mean height for them of 70 inches with a standard error of 2.0 inches. Multivariate Analysis 5. Analysis of Variance (Continued) • Let’s set our margin of error (probability of a Type-1 error) at 5%, meaning that we are looking at 1.96 standard deviations on the normal curve to indicate statistical significance. • Here is our question: If we allow the mean of females to “grow” by 1.96 standard deviations and the mean of males to “shrink” by 1.96 standard deviations, will they reach one another? Multivariate Analysis 5. Analysis of Variance (Continued) • The answer is no, not even close. The t-ratio (number of standard deviations on the normal curve needed to join the two groups) equals 26.7. • We can state that the difference in mean heights between ISU males and females is statistically significant at prob. < .05 (actually, considerably less than that; but that was our test margin). Summary of Data Analysis • Sociologists have at their disposal a wide range of statistical techniques to help them understand relationships among their variables of interest. • These techniques, when used properly, can help sociologists understand human societies for the purpose of improving human well-being. • Students who want to be professional sociologists must learn statistics and the proper applications of these statistics to data analysis. • Enjoy!