EVALUATING SUBSTANTIVE CONCLUSIONS BASED ON INCOMPLETE DATA: A METHODOLOGICAL COMPARISON OF MISSING DATA TECHNIQUES USED IN A STUDY OF THE EFFECT OF RACE ON TEACHERS’ EVALUATIONS OF STUDENTS’ CLASSROOM BEHAVIOR Heather Schwartz B.A., California State University, Sacramento, 2006 THESIS Submitted in partial satisfaction of the requirements for the degree of MASTER OF ARTS in SOCIOLOGY at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2009 © 2009 Heather Schwartz ALL RIGHTS RESERVED ii EVALUATING SUBSTANTIVE CONCLUSIONS BASED ON INCOMPLETE DATA: A METHODOLOGICAL COMPARISON OF MISSING DATA TECHNIQUES USED IN A STUDY OF THE EFFECT OF RACE ON TEACHERS’ EVALUATIONS OF STUDENTS’ CLASSROOM BEHAVIOR A Thesis by Heather Schwartz Approved by: _____________________________________, Committee Chair Randall MacIntosh, Ph.D. _____________________________________, Second Reader Manuel Barajas, Ph.D. ____________________________ Date iii Student: Heather Schwartz I certify that this student has met the requirements for format contained in the University format manual, and that this thesis is suitable for shelving in the Library and credit is to be awarded for the thesis. _____________________________, Graduate Coordinator Amy Liu, Ph.D. Department of Sociology iv _________________ Date Abstract of EVALUATING SUBSTANTIVE CONCLUSIONS BASED ON INCOMPLETE DATA: A METHODOLOGICAL COMPARISON OF MISSING DATA TECHNIQUES USED IN A STUDY OF THE EFFECT OF RACE ON TEACHERS’ EVALUATIONS OF STUDENTS’ CLASSROOM BEHAVIOR by Heather Schwartz Statement of Problem The problem of missing data in statistical analysis is one that the field of social research has failed to adequately address despite its potential to significantly affect results and subsequent substantive conclusions. The purpose of this study is to evaluate the practical application of missing data techniques in reaching substantive sociological conclusions on the basis of statistical analyses with incomplete data sets. This study compares three different methods for handling incomplete data: multiple imputation, direct maximum likelihood, and listwise deletion. Sources of Data The comparisons are conducted via a reexamination of a multiple regression analysis of the ECLS-K 1998-99 data set by Downey and Pribesh (2004), who reported the results of their study on the effects of teacher and student race on teachers’ evaluations of students’ classroom behavior using multiple imputation to handle missing data. v Conclusions Reached After comparing the three different methods for handling incomplete data, this study comes to the general conclusion that multiple imputation and direct maximum likelihood will produce equivalent results and arrive at the same substantive sociological conclusions. The current study also found that direct maximum likelihood shared more similarities with listwise deletion than with multiple imputation, which may be the result of differences in data handling by this author and Downey and Pribesh. In general, both direct maximum likelihood and listwise deletion produced increased significance levels and therefore a greater number of statistically significant variables when compared to the multiple imputation results. Still, all three methods produced basically equivalent results. The importance of taking method choice and missing data into careful consideration prior to performing a statistical analysis and drawing subsequent substantive conclusions is also stressed. _____________________________________, Committee Chair Randall MacIntosh, Ph.D. _____________________________________ Date vi TABLE OF CONTENTS Page List of Tables .............................................................................................................. ix Chapter 1. INTRODUCTION .................................................................................................. 1 Statement of the Problem .................................................................................. 1 Methods ............................................................................................................ 4 Organization of Current Study .......................................................................... 5 2. LITERATURE REVIEW ....................................................................................... 7 Effects of Race on Teachers’ Evaluations of Students’ Behavior .................... 7 Missing Data ................................................................................................... 13 Missing Data Mechanisms .............................................................................. 16 Handling Incomplete Data .............................................................................. 20 Traditional Methods ........................................................................................ 21 Direct Maximum Likelihood (DML) .............................................................. 23 Multiple Imputation (MI) ................................................................................ 27 Review of Missing Data Issues ....................................................................... 30 3. METHODOLOGY ............................................................................................... 33 Hypothesis....................................................................................................... 33 Sample............................................................................................................. 33 Dependent Measures ....................................................................................... 35 Independent Measures .................................................................................... 36 Control Variables ............................................................................................ 38 Evaluation of Missingness .............................................................................. 41 Analytical Plan ................................................................................................ 47 4. FINDINGS ............................................................................................................ 54 vii Methodological Comparisons ......................................................................... 60 Summary of Findings ...................................................................................... 68 5. DISCUSSION ....................................................................................................... 70 Discussion of Findings .................................................................................... 70 Evaluation and Critique of Study.................................................................... 71 Impact on Future Research ............................................................................. 78 Conclusion ...................................................................................................... 80 References ................................................................................................................... 82 viii LIST OF TABLES Page 1. Descriptive Statistics from Downey and Pribesh’s Study .................................... 42 2. Percentage of Missing Data from ECLS-K Public Use Data Files ....................... 45 3. Race of Student by Race of Teacher from Downey and Pribesh’s Study ............ 49 4. Descriptive Statistics for the Variables Used in the Listwise Deletion Analysis . 51 5. Unstandardized Regression Coefficients for the Dependent Variable Externalizing Problem Behaviors (Model 1 and Model 2 only) ....................................................... 56 6. Unstandardized Regression Coefficients for Dependent Variable Approaches to Learning (Model 1 and Model only) ........................................................................... 56 7. Unstandardized Regression Coefficients for Dependent Variable Externalizing Problem Behaviors (Model 3 and Model 4 only) ....................................................... 57 8. Unstandardized Regression Coefficients for Dependent Variable Approaches to Learning (Model 3 and Model 4 only) ........................................................................ 59 ix 1 Chapter 1 INTRODUCTION Statement of the Problem Empirical research relies heavily on the statistical analysis of quantitative data, collected by various means such as self administered questionnaires, surveys, and interviews. Unfortunately, cases and even entire variables must often be left out of these statistical analyses due to missing or incomplete data. At times, the incompleteness of a data set is the undesirable yet expected result of the research design, and thus generally ignorable (Fowler 2002; Rubin 1976; Schafer and Graham 2002; Stumpf 1978). For example, the researcher may choose to administer different versions of a survey to various subsets of the sample in order to save time or money and still obtain responses to a large number of questions. Missingness of a data set may also be the unplanned outcome of the research design, such as asking confusing questions or not providing all applicable response choices. Missing or incomplete data can also be the result of attrition, especially in longitudinal studies where respondents may die or records may be lost before all waves of data are collected. In social research, missing data is often the result of respondents forgetting or refusing to answer all questions. Respondents will often skip questions they do not understand or respond to only part of multiple part questions. In addition, many respondents will refuse to provide sensitive or private information, especially in the context of face to face interviews. In these instances, the missingness of data should be examined for its relationship to other variables and cases 2 and should not be ignored (Allison 2002; Byrne 2001a; Enders 2006; Fowler 2002; Rubin 1976; Rudas, 2005; Schafer and Graham 2002; Schafer and Olsen 1998; Stumpf 1978). Common statistical analysis methods are based on the assumption that the data set is essentially complete. Using conventional analytical methods with an incomplete data set will result in biased, unreliable results with diminished statistical power and the statistical program may not run correctly (Allison 2002; Collins, Schafer, and Kam 2001; Fowler 2002; Rubin 1976; Rubin 1987; Rudas 2005; Schafer and Graham 2002). However, most data sets are not complete, especially in social research and longitudinal studies. Nevertheless, the results of statistical analyses using incomplete data sets are often taken as valid and reliable (Allison 2002; Enders 2006; Rubin 1976; Rubin 1987; Schafer and Graham 2002). Fellow researchers may realize that the results may be biased due to a high rate of missingness, yet statistics are often utilized by those who do not possess such expertise. Consumers of the results of statistical research are often social and political institutions, such as hospitals, schools, and governmental agencies that draw on the results to determine significant issues such as funding and program efficacy (Regoeczi and Riedel 2003). Consequently, it is important that statistical analyses be performed using complete data whenever possible in order to maximize the reliability and validity of the results (Alison 2002; Schafer and Graham 2002; Schafer and Olsen 1998; Stumpf 1978). Thus, when a complete data set is not available or feasible, it is necessary for researchers to address the issue of missing data. Even with the proliferation of methodological and theoretical literature on the problem of missing data since the 1970s, many data analysts still fail to adequately 3 address the issue of missing data from a theoretical basis, often simply treating it as a nuisance that can be ignored (Allison 1999; Baydar 2004; Enders 2006; Regoeczi and Riedel 2003; Rubin 1976; Rudas 2005; Schafer and Graham 2002; Stumpf 1978; Wothke and Arbuckle 1996). Regularly, a high rate of missingness is acknowledged by the author as a limitation but not accounted for in the actual analysis (Schafer and Graham 2002). Additionally, though there are currently methods and software designed to effectively deal with missing data, many social researchers are either not aware of them or choose not to employ them (Allison 1999). Despite recent theoretical advances, “theory has not had much influence on practice in the treatment of missing data” (Wothke and Arbuckle 1996:2). When missing data methods are utilized, analysts continue to use outdated ad hoc editing methods to force data into a semblance of completeness and often make problematic assumptions regarding the mechanisms of missingness without any theoretical basis (Baydar 2004; Enders 2006; Rubin 1976; Schafer and Graham 2002). It has been said that is social research there is a general tolerance of imprecision (Espeland 1988). Effective methods have been developed, however they are often avoided and viewed as difficult to learn and not commonly applicable (Allison 1999). As will be discussed in depth, opinions regarding the employment of modern methods vary widely. For example, multiple imputation is generally accepted as an effective and adaptable method, however it is also time consuming and awkward to perform. On the other hand, direct maximum likelihood is difficult to learn, but often preferable to use though it has less applications (Allison 2002; Collins, Schafer, and Kam 2001; Enders 2006; Schafer and Graham 2002). Most 4 researchers simply use whatever statistical software package they are comfortable with, regardless of whether their data meets the assumptions required by that method and despite literature suggesting otherwise (McArdle 1994). The problem of missing data is complex as it is both an issue for researchers, who must choose when and how to deal with missing data, and for their audience, who consume the results of analysis on the basis that it is reliable, valid, and unbiased. This problem only becomes more complex when one introduces the issues of secondary data analysis (Rubin 1987; Rubin 1996). The problem of missing data in statistical analysis is one that the social research field has failed to adequately address and accept (Baydar 2004; Espeland 1988; McArdle 1994; Rudas 2005; Yuan and Bentler 2000). The broad purpose of this study is to bridge the gap between the methodological and theoretical literature regarding missing data techniques and their practical applications in reaching substantive conclusions on the basis of statistical analyses with incomplete data sets. Methods In order to attain this goal, this study will compare three different methods for handling incomplete data. This comparison will be conducted via a reexamination of a multiple regression analysis of the ECLS-K 1998-99 data set (Downey and Pribesh 2004). Downey and Pribesh (2004) reported the results of their study on the effects of teacher and student race on teachers’ evaluations of students’ classroom behavior using multiple imputation (MI). MI is a method which replaces missing values with some other value, which varies depending on the specific type of MI employed. According to Downey and Pribesh (2004), they followed Allison’s (2002) recommended data 5 augmentation model and performed five imputations (Allison 2002; Schafer 1999). This study will compare these results with those obtained via two other generally accepted methods, direct maximum likelihood (DML) and listwise deletion (LD). The most simple and commonly used method for dealing with missing data is deletion. In short, LD is performed by omitting any case that has missing data for any variable from the analysis altogether. There are other deletion methods, which will be discussed, however LD will be the focus of this study (Allison 1999; Allison 2002; Byrne 2001a; Collins, Schafer, and Kam 2001; Enders 2006; Schafer and Graham 2002; Stumpf 1978). DML is a frequently recommended method and considered comparable to MI. Simply put, DML identifies the best possible parameter value such that it maximizes the likelihood that the actual observed values will be produced by the estimated model. There are several varieties of maximum likelihood estimation, some of which will be reviewed, though direct maximum likelihood estimation will be utilized in the actual comparative analysis (Allison 2002; Eliason 1993; Enders 2006; Schafer and Graham 2002). In the course of performing this methodological comparison, the most commonly used statistical software packages used to perform each of the three methods of analysis will also be discussed. In addition, the various effects of method choice will also be considered. The aim of this comparison is to demonstrate how missing data methods influence substantive conclusions based on statistical analyses. Organization of Current Study Prior to performing these analyses, the theoretical and practical literature regarding missing data and methods for handling incomplete data sets will be reviewed at 6 length, with a focus on the three methods to be compared. The substantive literature on the effect of student and teacher race on teachers’ evaluations of students’ classroom behavior will be introduced as well. The specific methods to be employed will be discussed at length and then applied to the ECLS-K 1998-99 data set used by Downey and Pribesh (2004). Two separate regression analyses will be performed, using the DML and LD methods. The results of Downey and Pribesh’s analysis will be utilized as representative of the MI method. Then, the results of all three regression analyses will be compared and discussed to evaluate the efficacy, implementation, and appropriate applications of the different methods. In addition to presenting these findings, the current study will be evaluated and apparent limitations will be discussed as well as implications for future studies. 7 Chapter 2 LITERATURE REVIEW An objective of this study is to provide a cohesive link between the use of missing data methods in substantive research and the methodological theories regarding handling missing data in the field of social science research. This will be accomplished through the re-examination of data analysis performed by Downey and Pribesh (2004) in their study of the relationship between teacher and student race in teachers’ evaluations of students’ behaviors using three different methods for handling missing data. Prior to performing and comparing the various methods designed to handle incomplete quantitative data, the existing literature on these methods will be reviewed. However, in order to develop an understanding of the broad premise and significance of this current study, the substantive literature will first be reviewed. Effects of Race on Teachers’ Evaluations of Students’ Behavior Downey and Pribesh’s 2004 study examined the effects of students’ and teachers’ race on teachers’ evaluations of students, particularly on teachers’ subjective evaluations of students’ classroom behavior. Data from the ECLS-K 1998-99 kindergarten data set and NELS 8th grade data set were compared to investigate whether teachers’ poor evaluations of black students’ behavior are the effect of teacher bias or Oppositional Culture Theory. Oppositional Culture Theory emphasizes minority group agency and its role in hurting them by developing a culture in opposition to the values of the dominant group, particularly formal schooling (Downey 2008; Farkas, Lleras, and Maczuga 2002; Ogbu 2003). According to Downey and Pribesh, in order to support the Oppositional 8 Culture Theory, evaluations of black students must get worse as students age and adopt an oppositional culture ideology. Operationally, this would be exhibited by poorer evaluations of black 8th grade students as compared to kindergarten students, showing that black students do indeed change from eager learners into oppositional, defiant students. Downey and Pribesh’s theory of teacher bias is that teachers give less favorable evaluations to students of different backgrounds, in this case students from a different racial/ethnic background (Downey and Pribesh 2004; Ehrenberg, Goldhaber, and Brewer 1995; Long and Henderson 1971; Espinosa and Laffey 2003; Alexander, Entwisle, and Thompson 1987). The teacher bias hypothesis would be supported if both kindergarten and 8th grade black students received inferior evaluations from white teachers. The premise behind this hypothesis is that if white teachers are biased against black students or in favor of white students, it would be regardless of students’ actual behaviors as kindergarten students do not yet have a negative idea of school or authority. According to Downey and Pribesh, their findings “replicated the pattern that others have found: Black students are typically rated as poorer classroom citizens than white students,” (2004: 275). The results of Downey and Pribesh’s (2004) regression analyses indicate that the effects of racial matching are comparable for both kindergarten and 8th grade students, with white teachers giving black students poorer behavioral evaluations than white students. In Downey and Pribesh’s data analysis, for all models the regression coefficients indicated that black students receive higher ratings on the Externalizing Problem Behaviors scale (indicating more problem behaviors) and lower ratings on the Approaches to Learning Scale (indicating less good behaviors) than do 9 white students. Downey and Pribesh state that these less favorable evaluations are the function of the student-teacher race effect and not simply the effect of student race. In Model 4, they replaced the variables for students’ and teachers’ race with variables measuring the student-teacher racial dyad and found that black students matched with white teachers receive statistically increased reports of problem behaviors (b = .150, p<.001) and decreased reports of favorable behaviors (b = -.127, p<.001) when compared to white students matched with white teachers. . However, they did not find any significant differences when looking at black students matched with black teachers or white students matched with black teachers, further evidence that the poorer evaluations of black students are related to the teacher’s race. In addition, they report that black students receive more favorable behavioral evaluations when matched with black teachers than do white students matched with white teachers (b = -.063, p = .06) (2004:275). Downey and Pribesh interpret these results as evidence to support their hypothesis that black students receive poorer evaluations than white students as a function of teacher bias on the basis of student race. In summary, Downey and Pribesh (2004) report that these statistical results do not support the Oppositional Culture Theory explanation. On the other hand, the results do support the teacher bias explanation that white teachers are biased against black students. Based on this conclusion regarding subjective evaluations by teachers, Downey and Pribesh conclude that the next step would be to examine how student-teacher racial (mis)matching impacts objective measures such as achievement and learning. 10 In addition to variables measuring student and teacher race and student-teacher racial matching, Downey and Pribesh also included several control variables in their analysis. Taking these control variables into account decreased the effects of students’ race on negative evaluations and several control variables were statistically significant according to their published tables. Unfortunately, they do not include a discussion of these variables in their study. For one, female students received far more favorable evaluations than did male students in the areas of behavior (b = -.261, p<.001) and approaches to learning (b = .285, p<.001), effects that were larger than that of race according to Models 3 and 4. Another variable which had a larger effect on classroom evaluations than race was the type of parent(s) living in the student’s household. Those students with both biological parents in the home received more positive behavioral evaluations (b = -.189, p<.001) and were reported to have more skills for learning (b = .147, p<.001) as compared to other family compositions. Also, first-time kindergartners were reported to have far better scores on the Approaches to Learning scale (b = .211, p<.001) and less problem behaviors (b = -.147, p<.001) than those children who had been students before. Other variables which were statistically significant in the direction of more favorable evaluations were: socioeconomic status, student’s age, teacher’s educational level (for problem behaviors only), public school (for problem behaviors only), and the percentage of students who are black (for problem behaviors only). The only control variable found to be statistically significant in the direction of less favorable behavioral evaluations was the percentage of students eligible for free lunch in the school, although its effects appear to be minimal (b = .001, p<.01). Downey and Pribesh 11 do not report any discussion or statistic with which to evaluate the predictive power of their models, so it is unknown whether the variation in teachers’ evaluations of students’ classroom behaviors is adequately explained in these models. Still, one can see by the change in student race regression coefficients from Model 1 and Model 2 to Model 3 that the inclusion of these control variables does account for some of the variation that was attributed to student race in the first two models. When controlling for the independent variables included in Model 3, the effects of students’ race on teachers’ evaluations decreased dramatically (b = .151 and b = -.127 respectively). It should be noted that the change in variables to measure student and teacher race in Model 4 had little effect on control variables for either dependent variable. Furthering their scholarly contributions, Downey and Pribesh’s 2004 study has informed subsequent studies. Two studies have cited their findings as empirical evidence against Oppositional Culture Theory and proof of the need for policy changes in the areas of education and socialization of minority children (Downey 2008; Gosa and Alexander 2007). A substantial number of studies have utilized the work as support for the theory of teacher bias against students of dissimilar backgrounds, especially against minority students. The proposals based on these substantive findings include hiring more female and minority women for positions of power, eliminate tracking systems, provide supplemental on-campus programs for minority youths, demand and reward behavior favored by mainstream society, focus on enhancing non-cognitive skills in schools, and even challenge the idea that education equalizes other social inequities (Bodovski and Farkas 2008; Cohen and Huffman 2007; Condron 2007; Downey, von Hippel, and Broh 12 2004; Entwisle, Alexander, and Olson 2005; Lleras 2008; Shernoff and Schmidt 2008; Stearns and Glennie 2006; Tach and Farkas 2006). Thus, the substantive conclusions based on Downey and Pribesh’s (2004) statistical analysis will continue to have an impact on future applied and theoretical sociological studies. As with most studies, Downey and Pribesh’s work testing Oppositional Culture Theory has been critically reviewed by their peers. Criticism has primarily been based upon the problematic assumptions relating to the Oppositional Culture Theory in general, and their operationalization of it in this and prior studies (Ainsworth-Darnell and Downey 1998; Downey 2008; Farkas, Lleras, and Maczuga 2002; Tach and Farkas 2006; Takei and Shouse 2008). The current study will take a critical look at the methodological aspects of Downey and Pribesh’s work. Their substantive conclusion is directly related to the results of their statistical data analyses; however these analyses were performed using an incomplete data set. Previous studies have shown that there may be problematic implications with reliance on statistical data analyses using incomplete quantitative data (Regoeczi and Riedel 2003; Rubin 1976). Therefore, it is important to examine whether Downey and Pribesh’s results were valid. Were the results of their study skewed by missing data and did they handle the issue of missing data appropriately? The current study will examine these questions through a review of prior literature and methodological comparison. As explained above, Downey and Pribesh (2004) used a multiple imputation method to deal with the missing and incomplete cases in their statistical analyses. This is but one method used to handle missing and incomplete data sets in quantitative studies. The 13 literature regarding the problem of missing data in general will be reviewed below, followed by an extensive discussion regarding particular methods. Missing Data In the literature regarding handling missing data, the existence of missing data in social research study is depicted as common and often unavoidable (Allison 2002; Arbuckle 2007; Carter 2006; Fowler 2002; Regoeczi and Riedel 2003; Rudas 2005; Stumpf 1978). Data can be missing for a variety of reasons including oversight in data collection and recording, respondent drop out in longitudinal studies, planned missingness in research design, and refusal, among others (Byrne 2001a; Schafer and Graham 2002; Sinharay, Stern, and Russell 2001). The literature generally discusses three types of missing data in surveys: unit non-response, which indicates no information for a subject or case; item non-response, which is no information for a particular variable for a case; and undercoverage, either due to attrition or sampling issues (Madow, Nisselson, and Olkin 1983; Rudas 2005; Schafer and Graham 2002). Alternatively, others divide the three missing data situations into those of: omission, which includes unit and item non-response; attrition; and planned missing data, based on research design (Graham, Hofer, and Piccinin 1994). One of the most common reasons for missing data is refusal to answer a particular question in an otherwise complete survey, usually the refusal to provide a piece of personal or controversial information. Sinharay, Stern, and Russell (2001) found that income information is the most common item of refusal (around 14%). Whatever the reason, missing data is seen as inevitable in social research, especially surveys (Carter 2006; Yuan and Bentler 2000). 14 If incomplete data sets and other types of missing data are commonplace in survey research, then why is missing data considered a problem? Most social scientists are interested in substantive issues and are not focused on methodology; thus most statistical analyses are performed using standard complete-data methods even if the data are incomplete (Rubin 1978; Rubin 1987; Rubin 1996; Rudas 2005; Schafer et al 1996). This presents a problem as standard statistical methods are designed for complete, “rectangular” data sets (Little and Rubin 2002:3) where rows represents units (cases) and columns represent variables (items). Missing variables and cases alter the rectangularity of a data set resulting in a variety of missing data patterns. Most incomplete data sets used in sociological research result in a general or multivariate pattern of missing data due to item or unit non-response or a monotone pattern due to attrition of respondents in longitudinal studies (Little and Rubin 2002). However, most statistical analysis procedures cannot deal with a non-rectangular data set and sociologists often resort to ad hoc editing of data to make it fit the parameters required by the method of analysis (Rudas 2005; Schafer and Graham 2002). Given that complete-data methods cannot handle non-rectangular data matrixes, missing data can seriously bias results and conclusions drawn from empirical research. The extent and direction of this bias depends on the amount and pattern of the missing data. However, there are no clear guidelines of what constitutes a problematic amount of missing data or how to proceed with an incomplete data set (Bose 2001; Byrne 2001a; Fowler 2002; Madow, Nisselson, and Olkin 1983; Regoeczi and Riedel 2003). There are three general concerns when dealing with missing data: loss of efficiency, complication 15 in data handling and analysis, and bias from the difference between observed and unobserved data (Horton and Lipsitz 2001; Mackelprang 1970; Stumpf 1978). More specifically, the problems associated with performing statistical analyses on an incomplete data set can include biased, inefficient estimates of population characteristics, distorted distributions, and increased errors in hypothesis testing (Collins, Schafer, and Kam 2001; Madow, Nisselson, and Olkin 1983). Due to the reduced sample size available for analysis, parameter estimates lose efficiency and become biased because respondents are systematically different from non-respondents; the sample is no longer representative of the population (Arbuckle 2007; Rubin 1987). Incomplete data sets which are analyzed using complete data methods not only have biased and inaccurate results, but are often taken as valid statistical analyses upon which substantive conclusions are based. Many social scientists lack the methodological expertise to thoroughly evaluate the soundness of the statistical findings of a study (Rubin 1996; Schafer et al. 1996). Unfortunately, many public agencies and social scientists knowingly misuse statistics to further one’s cause or prove the efficacy of a certain social program. Therefore, the issues of incomplete data should be considered in all studies involving statistical analysis (Byrne 2001a; Madow, Nisselson, and Olkin 1983; Regoeczi and Riedel 2003). Surprisingly, this concept was largely ignored in statistics literature until the early 1970s, when Donald Rubin began his seminal work on the subject. His work was considered groundbreaking and has since provided the framework behind modern methods and terminology used to discuss them (Baydar 2004; Espeland 1988; Little and 16 Rubin 2002; Mackelprang 1970; Rubin 1976). After Rubin brought the issue of missing data into the limelight, there was a relative proliferation in concern with non-response, as evidenced by the creation of committees and panels for dealing with incomplete data. The federal Office of Management and Budget even created a guideline stating that no survey would be approved that anticipates a response rate of less than 50% (Rubin 1978). Unfortunately, despite the increase in theoretical literature on the subject and available methods and software which efficiently and validly deal with missing data, most social researchers still view these methods as novel and the subject remains largely ignored (McArdle 1994; Rubin 1996). Missing Data Mechanisms While the analysis of incomplete data is necessarily inferior to that of a complete data set, the efficiency of analysis procedures varies depending on the proportion of missing data and its distribution in the data set; this is especially true when data is systematically missing or when a large portion of the sample is missing. In most discussions regarding statistical analysis with missing data, one can find reference to missing data mechanisms. The three mechanisms have been termed “missing completely at random,” “missing at random,” and “missing not at random” (Allison 1987; Byrne 2001a; Collins, Schafer, and Kam 2001; Mackelprang 1970; Schafer and Graham 2002; Stumpf 1978). Many interpret these mechanisms as the cause or reason for the missingness. However, missing data mechanisms actually represent the distribution of missingness, not a causal relationship. Mechanisms reflect the possible relationships between the missingness and the value of missing items themselves as well as to other 17 measured variables, not what causes the data to be missing (Enders 2006; Little and Rubin 2002; Rubin 1976; Schafer and Graham 2002). Rubin (1976) stated that missing data was ignorable if it was missing at random and observed at random, an idea that has come to be referred to as missing completely at random in most discussions (Allison 1987). Rubin’s 1976 definitions of the three primary mechanisms of missing data have become the common terms for discussing this topic, and will be used in the current discussion. Missing completely at random (MCAR) When the missingness does not depend on the observed or missing data in the data set, then the missing data is said to be missing completely at random (MCAR) (Little and Rubin 2002; Schafer and Graham 2002). In other words, MCAR is when the probability of missing data on a particular variable is independent or unrelated to the value of that variable or the values of any other variables in the data set, observed or unobserved (Byrne 2001a; Enders 2001; Regoeczi and Riedel 2003). Basically, the observed values are a random subsample of a hypothetically complete data set where the non-respondents do no differ from respondents (Enders 2001; Little and Rubin 2002). If data is MCAR the missingness is considered ignorable, because the missing response is independent of all variables in the study and occurs by chance. In the case of MCAR, analysis remains unbiased (Sinharay, Stern, and Russell 2001). There is no relationship between the patterns of missing and observed data (McArdle 1994) and missingness is not related to any known or unknown variable relating to the data set (Horton and Lipsitz 2001). 18 MCAR is a special case of MAR, which is discussed below (Enders 2006; Schafer and Graham 2002). MCAR is the most restrictive assumption (Byrne 2001a). The MCAR assumption can be tested for statistically but is unlikely to hold (Regoeczi and Riedel 2003) unless the missingness is by design (Little and Rubin 2002). Missing at random (MAR) Data are missing at random (MAR) if the probability that an observation is missing can depend on the values of observed data, but not on the values of the missing item itself (Allison 2002; Baydar 2004; Byrne 2001a; Enders 2001; Enders 2006; Little and Rubin 2002; Regoeczi and Riedel 2003; Schafer and Graham 2002). Basically, the occurrence of missing values may be at random, but their missingness can be linked to the observed values of other variables in the data set (Byrne 2001a; Horton and Lipsitz 2001; Sinharay, Stern, and Russell 2001). MAR is also considered ignorable missingness, as some relationship exists between the patterns of missing data and missing scores, but the data are still observed at random (McArdle 1994; Rubin 1976). MAR is a less restrictive assumption than MCAR, (Byrne 2001a) but still not totally realistic in the social sciences. MAR is usually the case with planned missingness and can be viewed as a good working assumption (Allison 2002; Arbuckle 2007; Regoeczi and Riedel 2003; Schafer and Graham 2002; Yuan and Bentler 2000). However, there is no statistical test for MAR as there is no way to determine if missing values depend on the variable itself or differ systematically without knowing the values of the missing data. Methodologists suggest that the only way to be sure that missing data are MAR is to actually go back and collect the missing data. Some do suggest that 19 one practical way for evaluating whether data are MAR is that the missingness should be predicable based on other variables. Nevertheless, most also agree that minor departure from MAR may not cause significant bias, so the assumption of MAR is usually a safe one and tests are not as necessary. Missing not at random (MNAR) Missingness is considered to be missing not at random (MNAR) if the missingness is related to the missing and observed values. MNAR missingness is considered nonignorable because the missingness is related to the value that would have been observed (Allison 2002; Horton and Lipsitz 2001; Little and Rubin 2002; McArdle 1994; Schafer and Graham 2002; Sinharay, Stern, and Russell 2001). MNAR is the least restrictive and most plausible assumption in applied settings (Byrne 2001a). However, it is also the most problematic as MNAR missingness can have an effect on the generalizability of findings and introduce significant bias as a result of the systemic difference between cases with missing and observed variables (Arbuckle 2007; Byrne 2001a; Fowler 2002; Stumpf 1978). The treatment of missing data depends on the mechanism behind the missingness, however little attention is paid to this issue in practice (Allison 2002; Regoeczi and Riedel 2003; Stumpf 1978). Theoretically, in most statistical analyses it is generally assumed that missing data is accidental and random, and can thus be ignored (Rubin 1976). However, statisticians should examine the process behind missing data and include this process in their model and choice of method (Graham, Hofer, and Piccinin 1994; Rubin 1976). 20 Handling Incomplete Data Most literature on missing data methods actually indicates that the best method for dealing with the issues of incomplete data is to collect data as fully and accurately as possible and to handle missing data at the data collection phase (Allison 2002; Fowler 2002; Graham, Hofer, and Piccinin 1994; Madow, Nisselson, and Olkin 1983; Rudas 2005). However, this is not always possible or sufficient. The goal of any statistical procedure is to make valid, efficient inferences about the population (Rudas 2005; Schafer and Graham 2002). The general guidelines when handling incomplete data are not dissimilar: allow standard complete data methods to be used, capability of yielding valid inferences, and sensitivity of inferences to various plausible models for nonresponse (Rubin 1987). When he began his work on incomplete data, Rubin outlined three related objectives which must be met for properly handling non-response: adjust estimates for the fact that non-respondents differ systemically from respondents; expand standard error to reflect smaller sample size and differences between respondents and non-respondents; and expose sensitivity of estimates and standard errors to possible difference between respondents and non-respondents on unmeasured background variables (Rubin 1987). The overarching goal when dealing with incomplete data is to estimate predictors so that uncertainty due to missingness is taken into account (Allison 1987; Rudas 2005; Stumpf 1978). This avoids the serious bias which missing data can cause in completedata analysis methods (Sinharay, Stern, and Russell 2001). Adhering to this goal should provide consistent, efficient parameter estimates and good estimation of standard errors, 21 which will allow valid hypothesis testing and confidence intervals (Allison 1987). How to deal with missing data in a way which meets these goals is the question. In keeping with the majority of literature on missing data methods, traditional methods will be briefly reviewed before discussing in detail the three contemporary methods which are the focus of the current study. Traditional Methods Most traditional methods are ad hoc editing treatments for missing data and are not theoretically based. Their primary goal is to fix up the data so they can be analyzed by methods designed for complete data, to make the data fit back into the rectangular matrix (Allison 1999; Allison 2002; Stumpf 1978; Wothke and Arbuckle 1996). These methods generally work well in limited contexts where data are MCAR and only a small amount of missing data exists. Still, they are prone to estimation bias and are less powerful and less efficient than modern methods, even when the data is MCAR (Enders 2006). One type of method traditionally used is reweighting, in which complete cases are weighted so that their distribution more closely resembles that of the full sample or population (Fowler 2002; Schafer and Graham 2002). However, the most commonly used methods repair data before analysis by discarding records (deletion) or filling in values (imputation) (Enders 2006). Simple imputation methods fill in missing values with some estimated score, such as an average value (Schafer and Graham 2002). These imputation methods obtain the imputed score used to fill in for the missing value in various ways. However, all are considered fairly arbitrary, lacking variation, and significantly biased (Byrne 2001a; Enders 2006; Rubin 1987). Imputation methods are 22 more efficient than deletion as there is no data loss, but are more difficult to implement and can severely distort relationships (Schafer and Graham 2002). There are two primary deletion methods: Pairwise and Listwise. In pairwise deletion, only cases with missing values on variables included in a particular computation are excluded from analysis (Allison 1999; Arbuckle 2007; Byrne 2001a; Enders 2006; Schafer and Graham 2002). The sample size may vary widely across variables depending on the particular analysis and the sampled cases will be different for each analysis. Listwise Deletion (LD) The other principal deletion method is listwise deletion (LD), also called complete case analysis or casewise deletion (Allison 1999; Arbuckle 2007; Schafer and Graham 2002; Sinharay, Stern, and Russell 2001; Stumpf 1978). Although LD is based on the ad hoc ideas of other traditional methods, it is the most commonly applied method for handling incomplete data problems. In LD, cases are omitted which do not have complete records on all variables to be included in the analysis (Byrne 2001a; Carter 2006; Enders 2006; Little and Rubin 2002; Schafer and Graham 2002). LD can lead to a very small sample size and thus incorrect statistical results (Sinharay, Stern, and Russell 2001). However, unlike pairwise deletion, all analyses are calculated with the same set of cases and the final sample includes only cases with complete record. Yet, analysis of these complete cases may be biased because they can be unrepresentative of full population if missingness is not MCAR (Byrne 2001a; Carter 2006; Rubin 1987; Schafer and Graham 2002). 23 The opinion regarding LD is varied in the literature. The general opinion is that LD assumes data are MCAR, but can still produce somewhat accurate results with MAR and small amount of missing (Carter 2006; Enders 2006; Sinharay, Stern, and Russell 2001). However, others note that even with data that are MCAR, LD may be consistent but is still biased and inefficient as it often discards a large amount of data (Allison 1987; Arbuckle 2007; Carter 2006; Enders 2006; Little and Rubin 2002). There is general consensus that LD is inefficient when there is a substantial amount of missing data as a large portion of data is discarded from analysis (Allison 1999; Schafer and Graham 2002) resulting in a loss of information, reduced sample size, and overall decreased statistical power (Byrne 2001a). Still, most do agree that LD has its place in contemporary missing data methodology as it is the simplest method to achieve satisfactory results and remains the most widely used method (Schafer and Graham 2002). Direct Maximum Likelihood Estimation (DML) “Appropriate methods do not make something out of nothing but do make the most out of available data,” (Graham, Hofer, and Piccinin 1994:14). Maximum likelihood estimation (ML) is a theoretically based, iterative method for dealing with incomplete data (Byrne 2001a; Eliason 1993; Enders 2006; Little and Rubin 2002). ML has two basic assumptions: that the missing data mechanism is ignorable (MCAR or MAR) and a data fit a multivariate normal model (Allison 1987; Sinharay, Stern, and Russell 2001; Yuan and Bentler 2000). The basic goal of ML is to identify the population parameter values that are most likely to have produced a particular sample of the data, using only variables that are complete for case (Collins, Schafer, and Kam 2001; 24 Enders 2006). Basically, ML borrows information from observed variables and identifies the best possible parameter value such that it maximizes the likelihood that the actual observed values will be produced by the estimated model (Eliason 1993; Enders 2001). ML uses iterative algorithms to try out different values (Enders 2006). The discrepancy between each case’s data and estimation is quantified by the likelihood, which is similar to a probability measuring how likely a particular score is to occur from a normal distribution with a particular set of parameter values. There are three common ML estimation algorithms for missing data (Allison 2002; Enders 2001; Yuan and Bentler 2000): 1) Expectation Maximization (EM) Algorithm, a commonly used two-stage iterative method. The first step is to obtain estimates given the observed values and the second step is to find the maximum likelihood parameters using the data set completed with estimates contained in the first step. 2) Multiple Group Approach in which a sample is divided into groups so that each subgroup has the same pattern of missing data and a likelihood function is computed for each group and then maximized. 3) Direct Maximum Likelihood (DML) which is similar to the multiple group method except that the likelihood function is calculated at the individual rather than group level. DML is considered theoretically superior to other ML methods and will be the ML method of focus in this study (Byrne 2001b; Yuan and Bentler 2000). Direct Maximum Likelihood (DML) is also referred to as full information maximum likelihood and raw maximum likelihood (Navarro 2003). DML is a direct likelihood inference that 25 results from “ratios of the likelihood function for various values of the parameter” (Rubin 1976:586). DML is considered to be a direct method because there is no attempt to restore the data matrix to rectangular form (Byrne 2001b). Model parameters and standard errors are estimated directly from using all observed data and no values are imputed (Enders 2006). In direct approaches (DML and Multiple Group Approach) parameter estimates are obtained directly from available raw data without a preliminary data preparation step (i.e. imputation) and complete data analysis methods can be used (Enders 2001). This is unlike indirect approaches (EM Algorithm), in which an additional data preparation phase is necessary and additional analyses are necessary to recover lost residual variability. DML uses all of the available information on observed portions of variables to generate ML based statistics (Carter 2006; Navarro 2003). The DML estimate can be obtained by maximizing the data likelihood, a function linking the observed and missing data to the model parameters. It maximizes the observed data likelihood to obtain the ML estimates of the parameters (Little and Rubin 2002; Sinharay, Stern, and Russell 2001; Yuan and Bentler 2000). DML maximizes the case-wise likelihood of the observed data through an iterative function, which ultimately converges on single set of parameter values that maximize the log likelihood (Enders 2006; Wothke and Arbuckle 1996). Even among those who favor ML methods, DML has been historically unavailable and rarely used except in SEM applications (Allison 1999; Byrne 2001a; Collins, Schafer, and Kam 2001). The reasons for this lack of use is mostly that DML is model specific, complicated, and until recently required sophisticated methods (Sinharay, 26 Stern, and Russell 2001). The general feeling was that DML may be too much work for practical use, and that many prefer to use the EM algorithm over DML (Allison 1987; Yuan and Bentler 2000). Still, advantages of DML over ad hoc methods are clear. DML’s lack of reliance on MCAR is one obvious advantage (Wothke and Arbuckle 1996). It provides unbiased and valid inferences under MAR especially in large samples (Arbuckle 2007; Schafer and Graham 2002). Even when data are not quite MAR DML produces valid results, and with MNAR data DML is the least biased method. DML is able to yield standard error estimates which take into account the incompleteness in the data and provide a method for hypothesis testing (Byrne 2001a; Little and Rubin 2002). A theoretical advantage of DML over other methods is that it provides estimates without requiring the filling in missing values (Sinharay, Stern, and Russell 2001). DML is flexible, is not limited by the number of missing data patterns, and does not require complex steps to accommodate missing data (Carter 2006). DML is widely applicable to a variety of analyses including multiple regression and structural equation modeling (SEM) (Enders 2001). New algorithms and software, such as Amos, have made DML a simpler and more feasible option (Allison 1987; Arbuckle 2007; Collins, Schafer, and Kam 2001; Zeiser 2008). DML is the standard option available for dealing with missing data in the Amos program. Amos (Analysis of Moment Structures) is a widely available statistical software package which uses SEM in the analysis of mean and covariance structures. Procedures can be performed by path diagram (Amos Graphics) or equation statement (Amos Basic) (Byrne 2001b; Byrne 2001a; Cunningham and Wang 2005). With the 27 increased use of Amos, the DML method for analysis of incomplete data sets will become more mainstreamed. Multiple Imputation (MI) Multiple Imputation (MI) is another type of correction method, which is similar in practice to LD, but theoretically based (Wothke and Arbuckle 1996). MI does not create information but represents the observed information so as to make it appropriate for valid analysis using complete data tools (Rubin 1996). MI was designed for the survey context by Rubin in 1971 when he became concerned that non-respondents were systematically different than respondents in an educational testing survey (Rubin 1987). He developed MI for large sample surveys in which data is collected to be used by a variety of users and for a variety of purposes (Sinharay, Stern, and Russell 2001). In particular, MI was designed for instances where data base constructors and users are distinct entities (Rubin 1996). MI has three assumptions: matching imputation and analysis models, at least MAR if not MCAR, and a multivariate normal model (Allison 2000; Enders 2006). MI is the process of replacing each missing data point with a set of m>1 plausible values to generate m complete data sets to be analyzed by standard statistical methods (Allison 2000; Collins, Schafer, and Kam 2001; Enders 2001; Enders 2006; Freedman and Wolf 1995; Little and Rubin 2002; Navarro 2003; Penn 2007; Rubin 1987; Rubin 1996; Schafer 1999; Schafer and Graham 2002; Schafer and Olson 1998; Sinharay, Stern, and Russell 2001; Yuan 2000). Multiple imputations represent a distribution of possibilities and reflect the uncertainty about nonresponse bias (Fay 1992; Graham, 28 Hofer, and Piccinin 1994; Horton and Lipsitz 2001; Rubin 1978). MI is a three-step process: 1). Imputation: Starting from observations and using a predictive model based method, a set of plausible values for missing values are created. These values are used to fill in and create complete data sets. Multiple full data sets are created, each with a different set of random draws. 2) Analysis: Data sets are each analyzed using complete-data methods. 3) Combination: The results from each separate analysis are combined using straightforward calculations developed by Rubin to develop single results. There are two standard algorithms for performing MI: Propensity Score Classifier with approximate Bayesian bootstrap and Data Augmentation (Allison 2000; Carter 2006; Schafer and Graham 2002). Generally, data augmentation is considered to be the best algorithm because it produces little or no bias. Data augmentation is an iterative, regression based method of simulating the posterior distribution. Data augmentation iteratively draws a sequence of values of the parameters and missing data until convergence occurs. The iterative chain repeats two steps (Allison 2000; Collins, Schafer, and Kam 2001; Enders 2001; Enders 2006; Schafer and Olson 1998; Yuan 2000): I Step: Replaces missing values with predicted scores from a series of multiple regression equations. P Step: New covariance matrix and mean vector elements are randomly sampled from a posterior distribution that is conditional on the filled-in data from I Step. 29 I and P are repeated numerous times with imputed data sets saved at specified intervals, until convergence to their stationary distribution (Enders 2006; Yuan 2000). Data is said to converge when within variation approximately equals between variation (Little and Rubin 2002) or when the distribution of parameter estimates no longer changes between contiguous iterations (Enders 2001). The SAS program used by Downey and Pribesh (2004) uses a data augmentation MI method for imputing missing data based on Rubin’s guidelines (1987), which assumes MCAR or MAR and multivariate normality (Horton and Lipsitz 2001; Sinharay, Stern, and Russell 2001; Yuan 2000). Overall, MI is considered to be a simple and generalizable method (Little and Rubin 2002; Sinharay, Stern, and Russell 2001). One of the major advantages of MI is that missing data is dealt with prior to analysis, which makes the completed data set more available for secondary data analysis. This also encourages a more inclusive strategy of adding auxiliary variables to improve the missing data management without having to actually include them in subsequent analyses (Collins, Schafer, and Kam 2001; Penn 2007; Schafer and Graham 2002; Schafer et al 1996). The imputed data sets can be readily analyzed with available software and no special analysis software needed by the user (Enders 2006). Also, this allows for the ability to use different methods for data and analysis (Schafer 1999). In addition, MI has weaker, more realistic assumptions than traditional methods and still performs well when data does not fit the normal model (Allison 2000; Enders 2006). Because there is no omitted data, MI sustains the original sample size and thus reduced bias versus traditional methods. 30 Yet, MI has its drawbacks. The disadvantages of MI are that it takes more time, more effort, and more storage space than other methods because multiple data sets must be created and analyzed (Graham, Hofer, and Piccinin 1994; Little and Rubin 2002; Rubin 1987). In addition to being labor intensive, MI also requires more statistical expertise (Enders 2006). Although the ability to perform analysis and imputations separately is an advantage, the results can be invalid if the analysis model is not same as imputation model (Fay 1992). However, this is easily remedied by the imputer who should include as many variables as possible when doing imputations (Rubin 1996; Sinharay, Stern, and Russell 2001). Review of Missing Data Issues As the review of existing literature on the subject reveals, there are a wide variety of methods being currently used for handling incomplete data sets, ranging from simply ignoring the missingness to elaborate estimation and analysis methods designed to simulate the variability of a complete data set. Since Rubin initiated the discussion in the 1970s, there has been a relatively large amount of theoretical literature regarding the issue of missing data in statistical analysis. Unfortunately, theory has not had much influence in practice on the treatment of missing data (Wothke and Arbuckle 1996). Most methodologists advise one to avoid solving missing data problems by simply replacing missing values with arbitrary numbers, such as 0 or -9, as is the common treatment (Schafer and Graham 2002). Further, recent theoretical literature urges the treatment of missing values as a source of variation rather than viewing it as simply a nuisance as many traditional social scientists do (Sinharay, Stern, and Russell 2001). 31 Nevertheless, there is no real consensus about which methods should be employed to handle statistical analyses with missing data in practice (Collins, Schafer, and Kam 2001). Most methodologists do acknowledge that how missing data should be handled depends on the distribution of missing data in the particular data set to be analyzed (Regoeczi and Riedel 2003). Statisticians should consider the process behind missing data and include this process in their method choice (Graham, Hofer, and Piccinin 1994, Rubin 1976). In order for most statistical analysis methods to produce unbiased, valid results the assumption of MCAR must be made, even for those which are purported to be appropriate for incomplete data sets. Unfortunately, little attention has been paid to the issue of missing data mechanisms and method choice in practice. In some cases, such as MCAR with a small amount of missingness, LD may be an appropriate method given its simplicity (Allison 2002; Enders 2001; Regoeczi and Riedel 2003). However, with MAR or a substantial amount of missing data, LD is clearly inefficient and biased (Wothke and Arbuckle 1996). The methodological literature presents MI and DML as generally comparable methods (Collins, Schafer, and Kam 2001; Enders 2006). A large portion of contemporary literature recommends DML if at all possible as a first choice, and then MI if appropriate, under general MAR conditions over traditional ad hoc and simple imputation methods (Navarro 2003; Schafer and Graham 2002). However, some methodologists reveal that it may just be that the choice between methods is one personal preference and convenience rather than theoretically based (Allison 2002; Enders 2006). 32 Regardless of the rationale behind method choice, the fact remains that many social scientists perform statistical analyses using incomplete data sets. The results of such analyses are generally considered accurate and valid by both the researcher and their audience. Further, substantive conclusions and significant decisions are often based on the results of these statistical analyses. The remainder of the current study will focus on a comparison of three widely used, mainstream methods for handling incomplete data (LD, DML, and MI) and the subsequent evaluation of the substantive conclusions of Downey and Pribesh’s 2004 study. 33 Chapter 3 METHODOLOGY Hypothesis In light of the prior literature, the following hypothesis has been developed and will be further explored in the methodological comparison to follow: DML and MI will produce equivalent results and in application arrive at the same substantive conclusions. Therefore, DML should be used whenever appropriate to do so, even if MI is also appropriate, as it is generally easier to implement. Sample The current study is a re-examination of Downey and Pribesh’s 2004 study, “When Race Matters: Teachers’ Evaluations of Students’ Classroom Behavior.” Downey and Pribesh utilized data from the Early Childhood Longitudinal Study’s base year data collected from the fall kindergarten class of 1998-99, commonly referred to at the ECLS-K study (National Center for Education Statistics (NCES) 2004). In order to conduct a proper comparison of methods using Downey and Pribesh’s 2004 study as starting point, this study will utilize the same data set and variables as their original study. The ECLS-K is a nationally representative sample of 22,782 children who attended kindergarten in the 1998-99 school year. Base year data were collected when the children were kindergarten students. There have been four subsequent waves of data collection since, when the children were in the first, third, and fifth grades (NCES 2004). Sampling for the primary wave of ECLS-K involved a dual-frame, multistage sampling 34 design (West, Denton, and Reaney 2001). The primary sampling units (PSUs) utilized by the ECLS-K study were revised from existing multipurpose PSUs created from 1990 county-level population data in order to meet the study’s precision goals regarding size, race, and income. These PSUs were updated using “1994 population estimates of fiveyear-olds by race-ethnicity” (NCES 2004:4-1). These new PSUs were constructed for a minimum PSU size of 320 five year olds, a size which was designed to allow for over sampling of specific demographic groups. PSUs which did not meet the minimum standard were incorporated into an adjoining PSU. Next, private and public schools offering Kindergarten programs were selected from within the sampled PSUs. The school sampling frame was comprised of pre-existing school data, with schools not meeting the minimum number of kindergarten students clustered together. Finally, students were sampled from these Kindergarten programs, with the goals of obtaining a “self-weighting sample of students” and a minimum sample size for several targeted subpopulations (NCES 2004:4-8; West, Denton, and Reaney 2001). According to the ECLS-K study documentation: The Early Childhood Longitudinal Study-Kindergarten Class of 1998-99 (ECLS-K) employed a multistage probability sample design to select a nationally representative sample of children attending kindergarten in 1998-99. The primary sampling units (PSUs) were geographic areas consisting of counties or groups of counties. The second stage units were schools within sampled PSUs. The third and final stage units were students within schools. (NCES 2004:4-1) In all, 100 PSUs were selected for the ECLS-K. The 24 PSUs with the largest measures of size were designated as certainty selections or selfrepresenting (SR) and were set aside. Once the SR PSUs were removed, the remaining PSUs were partitioned into 38 strata of roughly equal measure of size. The frame of non-SR PSUs was first sorted into eight 35 superstrata by MSA/nonMSA status and by Census region. Within the four MSA superstrata, the variables used for further stratification were race-ethnicity (high concentration of API, Black, or Hispanic), size class (MOS >= 13,000 and MOS < 13,000) and 1988 per capita income. Within the four non-MSA superstrata, the stratification variables were race, ethnicity and per capita income. (NCES 2004:4-2) Once the sampled students were identified, school personnel provided contact information so that parents could be contacted for consent and to be interviewed themselves. In addition, each sampled student was linked to their primary teacher. Each case includes information of the student and parent, teacher and class, and school obtained via parent interviews, self administered teacher and school administrator questionnaires, and direct observation student assessments. This information is separated into three distinct data files: child file, teacher file, and school file (NCES 2004). In following the methodology of Downey and Pribesh (2004), this study will focus on only the Fall 1998-99 data. Further, the focus will be restricted to “the 2,707 black and 10,282 white students who were matched with either a black teacher or a white teacher in the fall of 1998” (Downey and Pribesh 2004:270). This sample of 12,989 includes 1,024 black teachers and 11,965 white teachers. Dependent Measures The dependent variables used by Downey and Pribesh (2004) were selected in order to measure teachers’ subjective assessments of students’ classroom behaviors. The two ECLS-K variables used are scales constructed by the NCES. The actual questions used to measure these scales are not available for public use due to copyright issues (NCES 2004). 36 The Externalizing Problem Behaviors scale asked teachers to rate how often the student argued, fought, got angry, acted impulsively, and disturbed ongoing activities. The Approaches to Learning scale asked teachers to rate the student’s attentiveness, task persistence, eagerness to learn, learning independence, flexibility, and organization. Responses ranged from 1-4 with 1 = Never, 2 = Sometimes, 3 = Often, and 4 = Very Often. There was a fifth response option of N/O = No opportunity to observe this behavior, which was coded as -9. Non-response was coded as a missing value with no numerical identifier. Since the Externalizing Problem Behavior Scale is measuring the frequency of negative behaviors, a higher score is actually representative of a poorer evaluation. Conversely, the Approaches to Learning Scale is measuring the frequency of positive behaviors, so a higher score is representative of a more favorable evaluation (Downey and Pribesh 2004; NCES 2004). Independent Measures The independent variables measure the student’s and teacher’s race. As indicated above, Downey and Pribesh (2004) only used black and white students match with black and white teachers in their analysis. Therefore, the ECLS-K variable for student race must be modified so that only black and white students are included in the analysis. The original variable is coded as follows: 1 = white, 2 = black, 3 = Hispanic (race specified), 4 = Hispanic (race not specified), 5 = Asian, 6 = Native Hawaiian/Pacific Islander, 7 = American Indian/Alaskan Native, 8 = More than one race (not Hispanic), -1 = Not Applicable, and -9 = Not Ascertained (NCES 2004). This is a composite variable created by the NCES and its components are not available for public screening. The ECLS-K 37 variable for student race is recoded so that 1 = white, 2 = black, and all other racial categories set to 0. Additionally, a filter will be applied so that only cases with values of 1 or 2 are included in the analysis (Downey and Pribesh 2004). The same sort of technique is applied to the ECLS-K variables for teacher race. However, the procedure is a little different as teacher race is measured using five separate variables (Native American/Pacific Islander, Asian, black, Hawaiian, and white). The teacher is coded as either 1 = Yes, 2 = No, or -9 = Not Ascertained for each of the racial/ethnic categories, with non-responses coded simply as system missing with no numerical value (NCES 2004). Only the variables asking if the teacher is black or white are utilized in Downey and Pribesh’s (2004) analysis. Further, only the 1 = Yes values are included in analysis. Thus, dummy variables for are created which set the 1 = Yes values to 1 and all other values to 0 for both the black and white teacher race variables. By doing this, only teachers who responded as being white or black are left with a numerical value. Finally, a single variable measuring teacher race was created by combining the black and white teacher race dummy variables. This new teacher race variable is coded as 1 = white and 2 = black with all other values set to 0 (Downey and Pribesh 2004). Several other independent variables were created by Downey and Pribesh (2004) to represent the (mis-)matching between student and teacher race. They created binary variables out of the teacher and student race variables to represent the four possible student-teacher race combinations (black student/black teacher, white student/white teacher, black student/white teacher, and white student/black teacher). The variables 38 were produced by creating a dummy variable for each of the student and teacher racial categories separately with 1 = the category to be measured (i.e. for the white student dummy variable 1 = white and all others = 0). The four student-teacher race variables are then computed from the product of the appropriate student and teacher dummy race variables. Downey and Pribesh (2004) included these variables in their regression, with the white student/white teacher variable as the omitted referent category in Model 4 (Downey and Pribesh 2004). Control Variables Several other independent variables measuring important demographic and background information of the student, teacher, and school are included in Downey and Pribesh’s (2004) analysis as control variables. These variables are: Gender, which is a composite variable used to measure the student’s gender based on the parent interview. The ECLS-K gender variable is coded as 1 = Male, 2 = Female, -9 = Not Ascertained, and non-response = no value. This variable is recoded to create a female dummy variable where 1 = Female and 0 = Male. Socioeconomic status, a categorical variable which measures the student’s family’s socioeconomic status. This variable is coded as 1 = (bottom) 1st Quintile, 2 = 2nd Quintile, 3 = 3rd Quintile, 4 = 4th Quintile, 5 = (top) 5th Quintile, and non-response = no value. This variable is also a composite derived from a logarithm of several other variables measuring the parent’s income, education, and occupational prestige (NCES 2004). 39 Student age, which is measured using a ratio variable reporting student age in months. This variable is also a composite which was calculated “by determining the number of days between the child assessment date and the child’s date of birth. The value was then divided by 30 to calculate the age in months” (NCES 2004:7-7). The actual values for each student are not available for public use. Types of parents in household, a nominal variable which measures the types of parent(s) which lived in the student’s household at the time of the survey. This variable is coded as 1 = Biological mother and biological father, 2 = Biological mother and other father (step-, adoptive, foster), 3 = Biological father and other mother (step-, adoptive, foster), 4 = Biological mother only, 5 = Biological father only, 6 = Two adoptive parents, 7 = Single adoptive parent or adoptive parent and stepparent, 8 = Related guardian(s), 9 = Unrelated guardian(s), and non-response = no numerical value. A dummy variable is created for the category with both biological parents with 1 = Biological mother and biological father and 0 = All other responses. First-time kindergartner status, a variable which measures whether or not the student is a first-time kindergartener and is coded as 1 = Yes, 2 = No, -8 = Don’t Know, 9 = Not Ascertained, and Non-response = no value. A dummy variable is created for first-time kindergarteners where 1 = Yes and 0 = No. Teacher’s educational attainment, a variable which measures the teacher’s highest degree achieved. This variable is coded as 1 = High School/Associate’s Degree/Bachelor’s Degree, 2 = At least one year beyond Bachelor’s, 3 = Master’s 40 Degree, 4 = Education Specialist/Professional Diploma, 5 = Doctorate, -9 = Not Ascertained, and Non-response = no value. Teacher age, which is a ratio variable measuring the teacher’s age in years. This variable was recoded by NCES from the teacher’s response indicating their birth year in order to protect their privacy and the actual response values are not available for public use. Public school status, which is a categorical variable measuring whether the child’s school is public or private with 1 = Public, 2 = Private, -9 = Not Ascertained, and Nonresponse = no value. Although Downey and Pribesh report in their Table 2 that the variable is measuring whether the school is identified as public with 1 = No and 0 = Yes, a closer examination reveals that the original variable is actually asking if the school is public with 1 = Yes and 2 = No (2004:272). This variable is recoded to create a dummy variable for public school where 0 = No and 1 = Yes, thus a higher value indicates that the school is public. Percentage of students eligible for free lunch, which is a continuous ratio variable which measure the percentage of students in the school who were eligible for free lunch at the time of the survey. This is a composite variable created by the NCES from the school administrators’ responses regarding the number of students enrolled in their school and number of students who were eligible for free lunch. Percentage of black students, which is a categorical variable measured at the school level to determine the percentage of students who are identified as black in that school at the time of the survey. Downey and Pribesh reported in their Table 2 41 (2004:272) that they were measuring the “percentage of minority students.” However in their Table 5, which reported the results of their regression, it states that the variable measured the “percentage of black students” (2004:276). Upon further examination, it was determined that Downey and Pribesh were in fact looking at the percentage of black students. The ECLS-K variable used is an ordinal variable which is coded as 1 = 0, 2 = More than 0 and less than 5, 3 = 5 to less than 10, 4 = 10 to less than 25, 5 = 25 or more, 7 = Refused, -8 = Don’t Know, -9 = Not Ascertained, Non-response = No Value. There is also a discrepancy in how this variable is coded as the ECLS-K codebook (NCES 2004) indicates the response ranges noted above, whereas Downey and Pribesh report in their Table 2 (2004:272) that the range is from 0 (none) to 5 (25% or more). It appears that the range noted in the ECLS-K codebook is correct. This variable is recoded to set all missing values (-7, -8, and -9) to system missing, thus leaving only the response codes 1-5 for use in the analysis. Descriptive statistics for all variables included in Downey and Pribesh’s (2004:272) analysis can be found in Table 1. Evaluation of Missingness As discussed earlier, there are many different types of incomplete data and various reasons behind why the data are missing which are important to consider prior to performing any statistical data analysis. Because the ECLS-K public use data files have been modified, it is difficult to accurately evaluate the amount, types, patterns, and mechanisms of missing data in the ECLS-K data set (NCES 2001). These modifications include the inclusion of substitute schools to counteract poor school response rates and 42 Table 1: Descriptive Statistics from Downey and Pribesh's Study Name of Variable Mean Standard Deviation Classroom Behavior Externalizing Problem Behaviors 1.65 .65 Approaches to Learning 2.98 .68 Black Student .21 .41 Black Teacher .08 .27 Black Student x Black Teacher .06 .24 Female Student .49 .50 Socioeconomic Status 3.26 1.36 Student’s Age (in months) 34.21 6.72 Mother/Father Household .66 .47 First-time Kindergartner .95 .21 Teacher’s Highest Degree 2.09 .91 Teacher’s Age 41.72 9.91 .76 .43 Student-Teacher Race Student Characteristics Teacher Characteristics School Characteristics Public School Percentage Eligible for Free Lunch 26.76 26.25 Percentage of Minority Students 1.39 2.92 Source: Downey and Pribesh 2004:272 43 composite variables and other mechanisms to protect the privacy of students and parents. Further, there are five different codes used for missing values, only one of which is readily recognized as missing by software such as SPSS and AMOS and several of which were not used in the construction of composite variables. Consequently, there is some disparity regarding the rates of missingness published by the NCES and others who utilized the ECLS-K data set depending on whether the restricted or public use data files were used and which missing data codes were counted as missing. Based on data obtained using the ECLS-K base year public use data files and SPSS, there is a marked disparity between the percentage of cases counted as missing and those coded as missing under the various missing values codes. For example, for the variable measuring teachers’ educational level, 15.4% of cases are coded as system missing and thus counted as missing by SPSS. However, another 5.3% of cases are coded as “not ascertained” and are given a numerical value of -9 (NCES 2001). Thus, one could either report a missingness rate of 15.4% or 20.7% for this variable and still technically be accurate. Due to these coding schemes as well as the recoding of many variables by NCES prior to the release of the ECLS-K public use data files, it is difficult to determine the rates and types of missingness present in the ECLS-K base year data. According to Downey and Pribesh, “missing cases were modest (i.e. less than 5 percent) for most variables. Percentage eligible for free lunches in ECLS is an exception with one third missing values in kindergarten.” (2004: 275). According the NCES (2001: 4-9), only 26.9% of public schools and 19.3% of all school having missing values for percentage eligible for free lunch. While these numbers remain high, they are not the one 44 third reported by Downey and Pribesh. However, based on the ECLS-K public use data files and SPSS’ descriptive statistics function, 41.27% of schools have missing date on this variable. This difference may be explained by the fact that the NCES reported statistics on the original school sample of 1,277 schools, whereas it is assumed that Downey and Pribesh reported statistics based on the 866 schools used in the public use data files and may have only counted certain missing values, while the 41.27% is the sum of all types of missing data for that variable (NCES 2001). Still, while Downey and Pribesh reported modest rates of missing data, there are several variables used in this analysis with relatively high rates of missingness as can be seen in Table 2. Incomplete data can introduce many problems in statistical analysis including nonresponse bias. Consequently, the National Center for Education Statistics has set a standard for its surveys where any survey with an overall response rate of less than 70 percent is subject to a nonresponse bias analysis (Bose 2001). The purpose of the nonresponse bias analysis is to identify any potential sources of bias and address them, if possible. Nevertheless, unless the true population values are known, an accurate 45 Table 2: Percentage of Missing Data from ECLS-K Public Use Data Files (obtained via SPSS) System Missing “Not Ascertained” Externalizing Problem Behaviors 9.4 1.60 11% Approaches to Learning 9.4 .37 9.77% Student Race 0 .33 0.33% Black Teacher 9.5 4.4 13.9% White Teacher 9.5 4.4 13.9% Gender 0 .06 0.06% Socioeconomic Status 5.3 Age 6.9 Types of Parents in Household 14.9 First-time Kindergartner 14.9 .04 Teacher’s Highest Degree 15.4 5.3 20.7% Teacher’s Age 15.5 3.81 19.31% Public School 14.6 .20 14.8% Percentage Eligible for Free Lunch 14.6 26.67 41.27% Percentage of Black Students 14.6 4.1 18.7% Name of Variable Classroom Behavior “Don’t Know” Total Student-Teacher Race Student Characteristics 5.3% 3.45 10.35% 14.9% .11 15.05% Teacher Characteristics School Characteristics 46 evaluation of bias is not possible. Further, even in cases where the real population values are known, there is no way to tell if bias is due to nonresponse or some other factor such as sampling bias. In order to fully understand the NCES’ studies on nonresponse, one must first understand that the NCES uses two separate components to discuss nonresponse bias. First, the completion rate refers to the percentage of participating units and is calculated independently for the various components of a survey. Second is the response rate, which refers to the overall percentage of participation in the study as determined by computing the product of the completion rate at the various stages (Bose 2001; West, Denton, and Reaney 2001). The ECLS-K survey is multifaceted and included data collected from various sources, including the school, the student, and the parents. In the base year (1998-1999), students had a completion rate of 92% and parents had a completion rate of 89%. Unfortunately, only 944 of the 1,277 (74%) schools sampled participated in the first year of the study and only 69.4% of schools participated in the fall wave. Therefore, when the student response rate was computed, as a product of the student and school completion rates, it came to only 68.1% and the parent response rate equaled only 65.9%. Thus, in following with NCES standards, a nonresponse bias analysis was conducted on the ECLS-K base year data (Bose 2001; West, Denton, and Reaney 2001). Five separate tests for potential nonresponse bias were conducted, and the findings indicated that there was no evidence of bias due to school nonresponse. Still, the NCES has admitted that recruiting schools to willingly participate in the first wave of data collection in the ECLSK was harder than for any of the subsequent waves and that this nonresponse and lack of 47 cooperation may have resulted in biased estimates not identified in the analysis (Bose 2001). In fact, where school response rates were below 65%, substitute schools were recruited with at least 74 substitute schools participating in the fall data collection (NCES 2001). Additionally, the NCES had to implement a “special refusal conversion effort” to recruit parents (NCES 2001: 5-14) and began offering monetary incentives to teachers. Unfortunately, there is no way for this current study to perform its own nonresponse bias analysis of the ECLS-K Base Year data set as the actual data in its original form with all values intact are not available for public use due to privacy issues. Analytical Plan As indicated above, this study will examine three separate statistical analyses in order to compare three different methods for handling missing data, with the aim of evaluating the results of Downey and Pribesh’s 2004 study of the student-teacher racial dyad. Each method will regress the two measures of teachers’ evaluations of students’ classroom behaviors on the students’ and teachers’ race and the interaction between students’ and teachers’ race using the same four models which Downey and Pribesh utilized. The missing data methods to be used are: listwise deletion (LD), as it is the most commonly used and generally applicable traditional method; direct maximum likelihood (DML), which is a relatively easy to implement but more sophisticated method; and multiple imputation (MI), the method which Downey and Pribesh (2004) chose. For each of these analyses, the exact same variables will be utilized and all measures will be taken to ensure these analyses remain comparable. The major difference will be the methodology used in dealing with the incomplete cases in the 48 ECSL-K 1998-99 data set. In order to employ the missing data treatments, the variables may further recoded to appropriately handle the incomplete, omitted, and other missing values in following the particular method’s standard procedures. Each of these methods has been discussed thoroughly in the review of prior literature. Listwise deletion The first of these analyses will be an ordinary least squares regression analysis using the method of LD to deal with missing data (Allison 1999; Allison 2002; Byrne 2001b; Enders 2006; Rubin 1976). This analysis will be performed using the standard SPSS statistical software package to edit the ECLS-K Base Year Public Use data set and run a multiple regression analysis with missing values excluded using the listwise deletion method. Before running the regression, each variable was examined and recoded as necessary to specify all missing data values as “system missing”. Only those values which represented meaningful categories were included in the analysis. On page 270, Downey and Pribesh (2004) include a table showing the distribution by race of black and white students matched with black or white teachers. After applying a filter which restricted the analysis to only black and white kindergarten students with black or white teachers in the Fall, the results obtained by performing this crosstabulation in SPSS were identical to those of Downey and Pribesh. These data can be found in Table 3. Aside from using this table as an initial point of comparison between the two methods, this table also provides some useful substantive information. As Downey and Pribesh point out, almost 70% of the black students are matched with a 49 teacher of a different race as opposed to less than 2% of the white students. This puts the disproportionate number of white teachers into perspective for the reader. Table 3: Race of Student by Race of Teacher from Downey and Pribesh’s Study Teacher’s Race Student’s Race White Black Total White 10,090 192 10,282 Black 1,875 832 2,707 11,965 1,024 12,989 Total Source: Downey and Pribesh 2004:270 Table 4 reports the descriptive statistics (mean and standard deviation) of the variables used in the multiple regression using LD after all recoding had been completed. When compared to Table 1, which reports the mean and standard deviation values according to Downey and Pribesh’s study, one can see that there is a slight difference in most of the values. For the most part, the mean and standard deviation values vary by less than +/- 0.10. Still, only three variables have matching mean and standard deviation values (female, mother/father household, and first-time kindergartner). Surprisingly, the mean value attained via SPSS for student age is 68.52 which is over double that reported by Downey and Pribesh (34.21). Upon closer examination, it appears that the value reported by Downey and Pribesh may be a typographical error. Since the variable measures student age in months, the mean age according to Downey and Pribesh would be 2.85 years, which is several years younger than the customary enrollment age for kindergarten. The mean age as determined via SPSS using the ECLS- 50 K variable for student age would be 5.71 years, which appears to be a more reasonable value for the mean age of kindergartners. Aside from the discrepancies in the student age variable, the only other difference which appears significant is that for the variable measuring the percentage of students in the school eligible for free lunch. The mean value from the SPSS analysis is 1.42 greater than that reported by Downey and Pribesh, with an increase in the standard deviation of 0.84. According to Downey and Pribesh (2004), the rate of missingness for this variable is the highest of all variables used with one third missing. This may be one explanation for the difference in mean values, as missing values were excluded through listwise deletion in SPSS which results in a smaller sample size and possible increase in mean values. 51 Table 4: Descriptive Statistics for the Variables Used in the Listwise Deletion Analysis Name of Variable Mean Standard Deviation Externalizing Problem Behavior 1.63 .64 Approaches to Learning 2.97 .68 Black Student .19 .39 Black Teacher .06 .24 Black Student x Black Teacher .04 .21 Female Student .49 .50 Socioeconomic Status 3.15 1.40 Student’s Age (in months) 68.52 4.34 Mother/Father Household .66 .47 First-time Kindergartner .95 .21 Highest Degree 2.12 .90 Teacher’s Age 41.80 10.06 Public School .77 .42 Percentage Elig. For Free Lunch 28.18 27.09 Percentage of Black Students 2.86 1.35 Classroom Behavior Student-Teacher Race Student Characteristics Teacher Characteristics School Characteristics 52 Direct maximum likelihood The second analysis will be performed using the method of DML estimation to handle the missing data issues (Allison 2002; Arbuckle 2007; Byrne 2001b; Enders 2006). This analysis will use the Amos Graphics software package with the full information maximum likelihood function enabled to account for missing data by estimating means and intercepts (Arbuckle 2007). The Amos program will utilize structural equation modeling to estimate the regression equation from the available data and the results from this analysis can be interpreted as if the analysis had been performed using a complete data set. Downey and Pribesh utilized adjusted standard errors and robust standard errors are available using Amos’ bootstrapping procedure, however this procedure requires complete data and therefore cannot be used in this study. Additionally, whereas Downey and Pribesh did not publish any measures of fit for their models, Amos calculates numerous model fit measures and the current study will use squared multiple correlations (R2) values as this was also the model fit measure calculated using the LD method. As indicated in the literature review, Amos allows the user to perform analyses using path diagrams and has a non-traditional user interface which may be difficult for social scientists to employ without specific training. Because of this and other program nuances, the literature suggests that all data editing and recoding should be completed prior to working with the dataset in Amos. It is recommended that SPSS be used to perform this recoding as Amos is able to use SPSS data files and recognizes the periods (.) in SPSS data sets as missing values (Arbuckle 2007; Zeiser 2008). Thus, prior to 53 performing the DML analysis in Amos, all variables to be used were examined and recoded as necessary to specify missing data values and create dummy variables using SPSS. Multiple Imputation The third analysis has already been performed by Downey and Pribesh (2004), using MI to treat the missing data and create a complete data set (Allison 2002; Byrne 2001b; Enders 2006; Schafer 1999; Schafer and Olson 1998). Downey and Pribesh (2004) used the SAS program to complete five imputed data sets using all the variables included in their analysis and perform the subsequent multiple regression analysis. Additionally, Downey and Pribesh utilized adjusted standard errors. The results reported for the MI method will be taken directly from Downey and Pribesh’s published results (2004). The results obtained from these three analyses will be compared to one another. The goal of this comparison will be to evaluate the substantive results published by Downey and Pribesh (2004). This will be achieved by valuating and comparing the statistical output and application of the various methods for handling missing data. This methodological comparison will test this study’s claim that DML and MI will produce equivalent results and that DML may be generally recommended over MI as it is easier to implement. 54 Chapter 4 FINDINGS This study tested the hypothesis that DML and MI will produce equivalent results and in application arrive at the same substantive conclusions; and thus DML should be used whenever appropriate, as it is generally easier to implement than MI. Some discrepancies were observed in this methodological comparison as MI resulted in fewer statistically significant independent variables and reduced significance levels when compared to DML and LD. However, in general, the hypothesis can be supported as MI and DML did produce equivalent results and arrive at the same substantive conclusions in the current methodological comparison. In testing this study’s hypothesis that DML and MI will produce equivalent results and thus will arrive at the same substantive conclusions in application, the results of an identical statistical analysis performed using both MI and DML will be evaluated and discussed. Additionally, the analysis was performed using the traditional LD method and these results will also be discussed. This discussion will focus on the MI and DML methods as these methods are generally supported on a theoretical basis in the literature. The basis for this methodological comparison is a study of the effect of race on teachers’ evaluations of students’ classroom behaviors performed by Downey and Pribesh (2004). Downey and Pribesh studied the effects of two dependent variables in four separate statistical models, with their substantive conclusion based primarily on the results of the fourth model. Therefore, while all models have been evaluated, the focus of this 55 discussion will be on the fourth model. As will be discussed below, Tables 5, 6, 7, and 8 present the quantitative results of all three regression analyses. When comparing the results from all three methods, there is a distinctive pattern which emerges. While not as prominent in Models 1 and 2 as in Models 3 and 4, it is evident from all models that MI resulted in fewer statistically significant independent variables and reduced significance levels when contrasted with DML and LD. This means the standard errors for the DML and LD analyses are attenuated. In fact, DML generally resulted in an increased number of statistically significant independent variables than even LD which is was an unexpected outcome. This overarching trend can be most clearly recognized when looking at the teacher and school characteristic variables in Model 3 and 4 for both dependent variables. There one can see that DML produced an increased statistical significance for a number of independent variables when compared with MI and LD, although LD also resulted in inflated numbers of statistically significant variables when evaluated against MI. Although this trend is prevalent in the teacher and school characteristic variables, there is only one variable which was found to be statistically significant by the MI method and not DML (with reduced significance level using LD): students’ socioeconomic status on the problem behaviors dependent variable. These trends and other methodological differences will be discussed below. 56 Table 5: Unstandardized Regression Coefficients for Dependent Variable Externalizing Problem Behaviors (Model 1 and Model 2 only). Model 1 Variable Model 2 MI DML LD MI DML LD .222*** .225*** .237*** .237*** .241*** .253*** (.017) (.015) (.016) (.018) (.016) (.016) -.116** -.132*** -.124*** .003 -.001 .013 (.029) (.021) (.024) (.054) (.046) (.047) -.162** -.177** -.184** (.061) (.055) (.055) Student-Teacher Race Black Student Black Teacher Black Student x Black Teacher Constant 1.63 1.60 1.61 1.62 1.60 1.61 R Square a .017 .018 a .017 .019 *p<.05, **p<.01, ***p<.001 (two-tailed tests). a Downey and Pribesh (2004) did not report an R Square statistic. Table 6: Unstandardized Regression Coefficients for Dependent Variable Approaches to Learning (Model 1 and Model 2 only). Model 1 Variable Model 2 MI DML LD MI DML LD -.240*** -.266*** -.270*** -.243*** -.274*** -.279*** (.017) (.016) (.016) (.019) (.016) (.017) .040 .086*** .056* .008 .013 -.020 (.034) (.022) (.025) (.058) (.049) (.049) .042 .098 .100 (.065) (.057) (.057) 3.03 3.01 3.01 3.03 .023 a .022 .024 Student-Teacher Race Black Student Black Teacher Black Student x Black Teacher Constant 3.01 R Square a 3.01 .022 *p<.05, **p<.01, ***p<.001 (two-tailed tests). a Downey and Pribesh (2004) did not report an R Square statistic. 57 Table 7: Unstandardized Regression Coefficients for Dependent Variable Externalizing Problem Behaviors (Model 3 and Model 4 only). Variable MI Model 3 DML LD MI Model 4 DML LD Student-Teacher Race Black Student Black Teacher Black Student x Black Teacher .151*** .142*** .176*** (.020) (.019) (.027) -.001 .029 -.020 (.053) (.045) (.069) -.208*** -.213*** -.182* (.059) (.053) (.082) White StudentWhite Teacher ---- ---- ---- Black StudentBlack Teacher -.063 -.033 -.026 (.034) (.028) (.045) Black StudentWhite Teacher .150*** .150*** .176*** (.020) (.020) (.027) White StudentBlack Teacher .006 .038 -.020 (.051) (.046) (.069) Students’ Characteristics Female Student -.261*** -.264*** -.271*** -.263*** -.264*** -.271*** (.011) (.009) (.015) (.011) (.009) (.015) -.018*** -.005 -.017** -.017*** -.005 -.017** (.005) (.004) (.006) (.005) (.004) (.006) -.006*** -.005*** -.007*** -.006*** -.005*** -.007*** (.001) (.001) (.002) (.001) (.001) (.002) Types of Parents in Household -.189*** -.205*** -.189*** -.192*** -.205*** -.189*** (.014) (.011) (.017) (.014) (.011) (.017) First-time Kindergartner -.147*** -.127*** -.132*** -.146*** -.128*** -.132*** (.028) (.023) (.037) (.029) (.023) (.037) -.020* -.016** -.035*** -.020* -.017** -.035*** (.009) (.006) (.009) (.009) (.006) (.009) Students’ SES Students’ Age Teachers’ Characteristics Highest Degree 58 Variable Model 3 DML LD MI Model 4 DML LD -.001 -.002*** -.002* -.002* -.002*** -.002* (.000) (.000) (.001) (.000) (.000) (.001) -.043* -.059*** -.055** -.044* -.062*** -.055** (.021) (.014) (.021) (.022) (.014) (.021) Percentage Eligible For Free Lunch .001** .001*** .001** .001* .001*** .001** (.000) (.000) (.000) (.000) (.000) (.000) Percentage of Black Students -.011* -.019*** -.017* -.013* -.019*** -.017* (.007) (.005) (.007) (.007) (.005) (.007) Constant 2.61 2.53 2.77 2.64 2.55 2.77 R Square a .086 .095 a .085 .095 MI Teachers’ Age School Characteristics Public School *p<.05, **p<.01, ***p<.001 (two-tailed tests). a Downey and Pribesh (2004) did not report an R Square statistic. 59 Table 8: Unstandardized Regression Coefficients for Dependent Variable Approaches to Learning (Model 3 and Model 4 only). Variable MI Model 3 DML LD -.127*** -.126*** -.171*** (.020) (.020) (.028) .024 -.025 -.050 (.056) (.046) (.070) .082 .148** .178* (.062) (.055) (.083) MI Model 4 DML LD Student-Teacher Race Black Student Black Teacher Black Student x Black Teacher White StudentWhite Teacher ---- ---- ---- Black StudentBlack Teacher -.022 -.006 -.042 (.040) (.029) (.046) Black StudentWhite Teacher -.127*** -.130*** -.171*** (.020) (.020) (.028) White StudentBlack Teacher .022 -.028 -.050 (.056) (.048) (.070) Students’ Characteristics Female Student .285*** .281*** .292*** .284*** .281*** .292*** (.011) (.009) (.015) (.011) (.009) (.015) .088*** .074*** .085*** .087*** .075*** .085*** (.005) (.004) (.007) (.005) (.004) (.007) .025*** .024*** .026*** .025*** .024*** .026*** (.001) (.001) (.002) (.001) (.001) (.002) Types of Parents in Household .147*** .175*** .156*** .146*** .176*** .156*** (.013) (.011) (.018) (.013) (.011) (.018) First-time Kindergartner .211*** .204*** .238*** .211*** .205*** .238*** (.030) (.024) (.038) (.029) (.024) (.038) -.006 .000 .010 -.006 .000 .010 (.011) (.006) (.009) (.003) (.006) (.009) Students’ SES Students’ Age Teachers’ Characteristics Highest Degree 60 Variable MI Teachers’ Age Model 3 DML LD MI Model 4 DML LD -.002 -.002*** -.003** -.002 -.002*** -.003** (.001) (.001) (.001) (.001) (.001) (.001) .005 .036** .034 .005 .039** .034 (.023) (.014) (.021) (.023) (.014) (.021) Percentage Eligible For Free Lunch .000 -.001** .000 .000 -.001** .000 (.000) (.000) (.000) (.000) (.000) (.000) Percentage of Black Students .005 .018*** .016* .007 .018*** .016* (.008) (.005) (.007) (.008) (.005) (.007) Constant .56 .68 .54 .55 .66 .54 R Square a .128 .134 a .127 .134 School Characteristics Public School *p<.05, **p<.01, ***p<.001 (two-tailed tests). a Downey and Pribesh (2004) did not report an R Square statistic. Methodological Comparisons Multiple Imputation: A summary of findings from Downey and Pribesh’s study For the purposes of this study, the results published by Downey and Pribesh (2004) will be utilized as an example of a multiple regression analysis using the MI method for handling incomplete and missing data and are the primary referent against which the other two methods are being evaluated. The MI columns of Tables 5, 6, 7, and 8 present the results of Downey and Pribesh’s analysis which found that black students matched with white teachers receive statistically increased reports of problem behaviors and decreased reports of favorable behaviors when compared to white students matched 61 with white teachers (2004:276). They did not find any significant differences when looking at black students matched with black teachers or white students matched with black teachers. As noted above, when compared to DML and LD methods, the MI analysis resulted in decreased levels of statistical significance for a large number of variables. In Model 1, the black teacher variable had a decreased regression coefficient and decreased significance level using the MI method as opposed to the DML or LD methods. While there were no significant differences between MI and DML or LD in Model 2, this trend becomes more obvious in Model 3 and 4. In Models 3 and 4 with the problem behaviors dependent variable, the MI significance levels for all teacher and school characteristic variables are lower than they are using DML and most are lower than with LD. This pattern continues with the learning approaches dependent variable, with the exception of the variable for teachers’ education level. However, this trend is not universal; in Model 3 for the dependent variable externalizing problem behaviors, there is one variable which was found to be statistically significant using the MI method, but not with DML: students’ socioeconomic status. Whereas it was expected that MI would produce less inflated estimates when compared to LD, it was not expected that the estimates would be so unlike those obtained via DML. Direct Maximum Likelihood The results of the regression analysis using the DML method to handle missing data can be found in the DML columns of Tables 5, 6, 7, and 8. As indicated above, the findings from the DML analysis do vary somewhat from those of the analysis using MI 62 and are fairly similar to those obtained via LD. While the regression coefficient and standard errors values do differ slightly, the major differences between MI and DML are found in the coefficient significance values. Aside from a few variables, DML resulted in increased levels of statistical significance when compared to both the MI and LD methods. Again, this pattern is most clearly evident when looking at the school and teacher characteristic variables in Model 3 and 4, although it is also present in Model 1. Looking at Model 1 with Approaches to Learning as the dependent variable, one sees that the teacher race variable is highly statistically significant in the DML analysis (b = .086***) whereas it was not significant at all in the MI analysis. This was also the case in the LD analysis (b = .056*). For all models, the DML R2 values reveal that these models have somewhat weak predictive power with the largest R2 value being .128. The R2 values for this Model 1 indicate that only 1.7% of the variance in the Externalizing Problem Behaviors scale and 2.2% of the Approaches to Learning scare is explained by student and teacher’s race alone, values which are also very similar to those obtained via LD. When the first student-teacher racial matching variable, black student-black teacher, was added in Model 2, there were actually no discrepancies found between the regression coefficient significance levels in the MI and DML analyses. This change in variables for teacher and student race had no affect on the squared multiple correlations values, which remain at .017 and .024 respectively. The majority of the differences between the DML and MI methods arise in Models 3 and 4. In Model 3, the addition of independent variables as controls increases the discrepancies between the DML and MI methods. Most of these differences are found 63 between the control variables’ significance values. Many of the teacher and school characteristics which were not significant in the MI analysis were when the analysis was performed using DML, this was also the case in the LD analysis. For the dependent variable Externalizing Problem Behaviors, there is no discrepancy in the significance levels and very little discrepancy in the coefficient values for the three variables measuring student and teacher race. However, there is one variable which was highly significant in the MI analysis which is no longer statistically significant using DML, students’ socioeconomic status. Conversely, there are four variables with increased significance values using DML versus MI on both dependent variables: teachers’ education, public school, percentage eligible for free lunch, and percentage of black students. Further with the problem behaviors dependent variable, teachers’ age was found to be not significant in the MI analysis but is highly statistically significant in the DML analysis, although its effects appear minimal (b = -.002***), as was the case in the LD analysis. The differences between the MI and DML findings continue when Model 3 for the dependent variable Approaches to Learning is examined. In this analysis, the student-teacher racial matching variable is found to be a significant indicator with black students matched with black teachers receiving better evaluations than other studentteacher racial combinations (b = .148**), although this variable was not found to be significant in the MI analysis. With the substitution of student-teacher racial matching variables for student and teacher race variables in Model 4, there is not much change in the incongruity between methods found in Model 3. For the dependent variable Externalizing Problem Behaviors, 64 again the student’s socioeconomic status is not significant using DML, whereas it was with both the LD and MI methods. Also, there are four variables which have increased significance values than that obtained via the MI method: teacher’s education, teacher’s age, public school, percentage eligible for free lunch, and percentage of black students. All of the indicators for the Approaches to Learning variable that were significant in Model 3 remain so in Model 4. Thus, the four variables (teacher’s age, public school, percentage eligible for free lunch, and percentage of black students) remain statistically significant in the DML analysis, although they were not found to be significant using the MI method. In this fully specified model, the squared multiple correlations values are .085 for Externalizing Problem Behaviors and .127 for Approaches to Learning. Despite the literature which suggested that the DML and MI results would be similar, there were actually more similarities between the DML and LD methods and more inconsistencies between the DML and MI methods. In fact, the DML method found more variables to be significant and generally increased significance values even when compared to the LD method. The most glaring example of this is that in Models 3 and 4, both the DML and LD methods found the teacher and school characteristic variables to have increased statistical significance as indicators for both dependent variables when compared to MI. It is also important to note that only one variable was found to be significant by the MI method and not by the DML method, students’ socioeconomic status. While the theoretical literature suggests that DML and MI should reach equivalent results and that LD will produce dramatically biased and inefficient 65 results, this comparison unexpectedly revealed less similarities between DML and MI than with the LD method. Listwise Deletion The LD columns of Tables 5, 6, 7, and 8 report the results of regressing the two measures of teachers’ evaluations of students’ classroom behavior on students’ and teachers’ race using SPSS to perform the regression analysis with missing values excluded via LD. In following Downey and Pribesh’s design, four separate models were used for each of the two dependent variables. While Downey and Pribesh did not publish any evaluations of their models’ predictions, the current study tested the LD models’ predictions using squared multiple correlation (R2) values. The R2 values were statistically significant for all models using LD, and the fully specified models have R2 values of .095 and .134. Given the differences in the results of the MI and LD methods of handling incomplete and missing data with only one data set, one can clearly see how method choice can effect substantive conclusions. It is evident that the LD analysis has resulted in increased statistical significance for a number of variables as compared to MI. This may be due to the fact that there are generally far fewer cases being counted in LD versus the MI method. While Downey and Pribesh (2004) did not publish their sample size using MI, it is assumed that they utilized the entire sample of black and white students matched with black or white teachers (n = 12989); whereas the fully specified models have a significantly reduced sample size using LD (n = 6917 for Externalizing Problem Behaviors and n = 6984 for Approaches to Learning). Substantive conclusions can vary 66 considerably based on simply choosing a different method of dealing with missing data and subsequent analysis. For example, if one were to perform this analysis using MI as Downey and Pribesh did, one would not consider teacher and school characteristics such as teacher’s educational level, public versus private school, and percentage of students in the school eligible for free lunch to be significant whereas someone performing the same analysis using LD may base their substantive conclusion on the inflated significance of these variables. While it is difficult to determine which method to choose, a brief review of the LD analysis will support the literature which advises against the use of LD on a theoretical basis. As with DML, while the differences between MI and LD are most clearly seen in Models 3 and 4 they are present in Model 1 as well. In the first model, which looks at the effects of students’ and teachers’ race, the LD analysis resulted in the increased statistical significance of the black teacher variable on both dependent variables when compared to MI. However, like the DML analysis, there are no significant differences between the LD and MI results for Model 2, which includes the student race and teacher race variables with the addition of a variable measuring the interaction of student-teacher race (black student x black teacher). With the addition of the black student-black teacher variable, the variable for teacher race is no longer statistically significant for either of the dependent variables (recall that Downey and Pribesh did not find that it was significant in the Approaches to Learning Model 1). In just examining these two models, it is already apparent that the LD analysis has resulted in increased significance levels when evaluated against the MI analysis. Yet, its similarities with DML are also emerging. 67 The discrepancies between the LD and MI methods increase with the addition of control variables in Models 3 and 4. While there is some minor variation in the values of the regression coefficients and standard errors, the primary differences are evident in the significance values. Again, this is similar to the pattern seen in the DML comparison. In looking at the Model 3 analysis with the Externalizing Problem Behaviors scale as the dependent variable, it is evident that there is an inconsistency in the significance values of several variables. Both black student-black teacher and students’ socioeconomic status were found to have higher levels of significance in the MI analysis than in the LD analysis. Several other variables were found to have higher significance levels in the LD analysis than in the MI analysis: teacher’s educational level, teacher’s age (which was not found to be significant at all in MI analysis), and public school. In the Model 3 for Approaches to Learning, there are three statistically significant variables which were not significant in the MI analysis: black student-black teacher, teachers’ age, and percentage of black students in the school. One can see that teacher and school characteristics have an increase in significance levels in the LD results, as compared to the MI analysis, a trend which was also evident in the DML analysis. In Model 4, Downey and Pribesh replace the student race and teacher race variables with three measures of student-teacher racial matching, with the white studentwhite teacher combination as the omitted referent category. This change in variables had very little effect on the results and did not result in much change in regard to the remaining independent variables. Thus, the discrepancies found between the MI and LD methods in Model 4 are similar to those from Model 3. For the problem behaviors 68 analyses, the students’ socioeconomic status remains less significant in the LD analysis than in MI. Additionally, teachers’ educational level, public school, and percentage of students eligible for free lunch in the school have increased statistical significance in the LD method as compared to the MI method. In the analysis using Approaches to Learning as the dependent variable, the discrepancies between the LD and MI methods continue. Both teachers’ age and the percentage of black students in the school are statistically significant in LD whereas they where not in MI. In both Model 3 and 4, variables were found to be statistically significant using the LD method but not significant using the MI method which could result in very different substantive conclusions regarding the effects of teacher and school characteristics on teachers’ evaluations of students’ behaviors. Thus, the current methodological study supported the theoretical literature’s opinion that LD will result in biased and inefficient estimates due to dramatically decreased sample size. However, this study’s findings also departed from the literature in that it found the same patterns of increased significance levels in the DML method as it did in LD. Summary of Findings Given this comparison of three different methods for handling incomplete data (LD, MI, and DML), the hypothesis that DML and MI will produce equivalent results cannot be fully supported. While DML and MI did produce comparable results regarding the primary independent variables' effects on the dependent variables, they produced substantially different findings regarding which control variables may be significant indicators as well, with both DML and LD having inflated significance levels for several independent variables. While this did not affect Downey and Pribesh’s ultimate 69 conclusion, it may have if the study’s authors had not been as focused on the race variables. If the independent variable of interest had been a teacher or school characteristic, it may have been relegated to being not significant using MI but would be found statistically significant using DML or even LD. Consequently, while DML and MI did produce equivalent results from the perspective of Downey and Pribesh, the methods may have resulted in entirely different findings if one of the control variables had been the focus of the study instead of race. Nonetheless, the hypothesis that in application DML and MI will arrive at the same substantive conclusions can be supported. Downey and Pribesh’s substantive conclusion, that black students receive less favorable evaluations on their classroom behavior as a function of teacher bias, would be supported by performing the statistical analysis with any of these three methods. In all three methods, student race remained a highly significant indicator of poor evaluations on both dependent variables for all models. Additionally, the black student-white teacher racial matching variable was found to be statistically significant with black students paired with white teachers receiving the worst evaluations when compared to the other black/white student-teacher racial combinations across all models and with all methods. Therefore, even though there are somewhat significant differences in regard to which additional independent variables are significant indicators of the dependent variables, the evidence which Downey and Pribesh used to arrive at their substantive conclusion is present using any of these three methods. Thus, one can conclude that DML and MI in application may arrive at the same substantive conclusions. 70 Chapter 5 DISCUSSION Discussion of Findings Although the literature suggested that DML and MI would produce somewhat equivalent results, the current study found that DML shared more similarities with LD than with MI. In general, both DML and LD produced increased significance levels and therefore a greater number of statistically significant variables when compared to the MI results. Despite these differences, the regression analyses using DML and LD both still resulted in findings which supported the substantive conclusions Downey and Pribesh (2004) reached using the MI method; thus, the hypothesis that in practice MI and DML will produce equivalent results was generally supported. The hypothesis also suggested that because MI and DML will produce equivalent results, DML should be used whenever possible as it is generally easier to implement. Unfortunately, the current study was unable to adequately test this hypothesis as this author did not execute the analysis using MI, but used Downey and Pribesh’s published results using MI instead. In application this author did find that both DML and LD were fairly equivalent and simple in their implementation, with the only drawback to implementing DML being the need to acclimate to the use of a graphical user interface to input data in a path diagram as opposed to the traditional menu driven programs. Therefore, one can conclude that with the increased theoretical support for the use of DML and its fairly simple implementation that it should definitely be favored over LD whenever possible. Still, as the literature suggests that MI procedures require a large 71 amount of computer memory space and processing time due to the creation of numerous imputed datasets, one may be fairly certain that DML should be suggested over MI when appropriate as it takes no additional processing time or storage space. Based on the prior literature, one can conclude that the MI estimates are the most valid and efficient and least biased of the three used in this methodological comparison. Accordingly, one would also conclude that the inflated results obtained via the DML and LD methods may have been effected by bias due to missing data, at least to some extent. For these reasons, the prior literature advises that one should evaluate the missingness of a data set before beginning a statistical analysis. Unfortunately, the current study was unable to adequately review the missingness of the ECLS-K dataset as the public use data files were edited by the NCES prior to release, concealing a great deal of information from the data analyst. Evaluation and Critique of Study The current study is a re-examination of an existing study by Downey and Pribesh (2004) on the effects of students’ and teachers’ race on teachers’ evaluations of students’ classroom behaviors using data from the ECLS-K dataset. Downey and Pribesh performed a secondary data analysis to test whether black students received less favorable evaluations than white students due to teachers’ racial bias or black students’ adoption of Oppositional Culture. Based on their statistical findings using the MI method to handle missing data in the ECLS-K dataset, Downey and Pribesh concluded that the black students received more negative evaluations than white students as a result of teachers’ racial bias. As Downey and Pribesh state, these findings show that race 72 continues to be an important factor in the classroom and that the idea of student-teacher racial matching deserves further attention. They suggest that the next step is to study how racial matching affects students’ academic achievement. As with all studies, there are some limitations to Downey and Pribesh’s study. Their paper provides a brief discussion of the limitations they perceived, such as the fact that their findings were based on a limited amount of data gathered at only one specific point in time, included only teachers’ evaluations of students’ behaviors, and did not include any students’ evaluations of teachers. There are other apparent weaknesses to Downey and Pribesh’s study. As with all statistical analyses, there are limitations based on the particular dataset used. The limitations of this study inherent to the secondary analysis of the ECLS-K dataset will be discussed in detail in a separate section below. Downey and Pribesh’s study can also be critiqued based on a close examination of the study’s methodology and theoretical reasoning. When reviewing published studies based on statistical analyses, one must be sure to evaluate the theoretical and methodological basis for the substantive conclusions made. On a basic level, one must ask whether the variables are actually measuring what the authors claim that they are measuring and whether the model is actually testing the hypothesis that the authors have developed. As Downey and Pribesh point out, they only use measures of teachers’ evaluations of students’ classroom behaviors from the ECLS-K as the basis for their substantive conclusion that teachers are biased against black students. This is problematic as they do not use the most common and official means by which teachers evaluate students, grades. Teachers receive training to evaluate students 73 on an academic grading system and may not have understood the evaluation scheme of the ECLS-K survey or may have provided evaluations without much thought. Therefore, while it is important to understand teachers’ opinions and attitudes regarding students’ behaviors, the Externalizing Problem Behaviors ad Approaches to Learning Scales may not have been the most reliable or valid measures to use. Additionally, whereas the ECLS-K gathered fairly detailed ethnic and racial data from a nationwide sample of student and teachers, Downey and Pribesh only used data from black and white students and teachers in their analysis. These findings were generalized in the substantive conclusion that black students receive less favorable evaluations based on teachers’ racial bias without ever considering students and teachers of other racial and ethnic groups. The problem of choosing variables to measure race quantitatively begs the question of whether it is even appropriate to study race in this manner. One must consider whether the concept of race can be reduced to a single variable and whether quantitative studies can truly be used to examine problems involving race. After careful examination of the main variables used in Downey and Pribesh’s study, one can see that perhaps the variables were not measuring what the authors claimed to be measuring. Downey and Pribesh compare the results of two independent statistical analyses to test their hypothesis that black students receive less favorable behavioral evaluations in school as a function of teacher bias as opposed to Oppositional Culture Theory (OCT). While they provide quantitative support for their conclusion, it is questionable why they did not simply test their teacher bias hypothesis and the OCT hypothesis instead of pitting the two theories against one another. Downey and Pribesh pose their research question as 74 if these two are the only possible causes of poor evaluations of black students’ behavior and as mutually exclusive options. The authors also seem closed to the idea that any variable besides race could have a significant impact on teachers’ evaluations as there was little discussion on other significant variables in the analysis, such as family, teacher, and school characteristics. Further, as discussed in the literature review, both of the teacher bias and OCT hypotheses are based on theories which are considered to be problematic and myopic by many sociologists. In summary, they simply put blame on either the teacher or the student and fail to account for the larger processes at work behind racial bias and oppositional culture. Downey and Pribesh further downgrade the matter by isolating the focus to only black and white racial matching. Again, after evaluating the theoretical basis of Downey and Pribesh’s study, one reveals reasoning which appears problematic and possibly biased. Just as Downey and Pribesh’s research methods and substantive conclusions were thoroughly examined and critiqued, one must scrutinize the current study to determine its strengths and weaknesses and reveal any potential problems. After a review of the prior literature, it is apparent that there is a lack of discussion regarding missing data methods in sociology; one of the primary strengths of the current study is that it will work to fill this gap. In addition to simply supplying needed information regarding the importance of considering missingness prior to data analysis, the current study can be used to revitalize the discussion regarding the proper use and treatment of incomplete datasets and evaluation of substantive conclusions based on incomplete datasets. This methodological comparison as a means of evaluating substantive conclusions is an innovative approach to 75 sociological research and may provide the impetus for future studies, an obvious positive feature of the current study. As with any study, the current analysis is not without its limitations. These include the fact that the methodological comparison was limited to the analysis of only one dataset using only three methods, with only two of the analyses executed by this author. Consequently, one may conclude that the discrepancies between the MI and DML methods and similarities between the DML and LD methods may be the result of differences in data handling by this author and Downey and Pribesh. Despite the fact that the selection of methods used based on a review of the prior literature on the subject, one must still question whether or not the results of this study are generalizable to other data analyses. Although the findings of this study did support the original hypothesis that MI and DML will lead to the same substantive conclusions in application, it was not anticipated that the DML and LD methods would produce for similar results than the MI and DML methods. These unexpected findings only lead to further doubt of the generalizability of the findings, a major limitation of the study. One possible manner to address this area of weakness would be to perform further methodological comparisons and create a meta-analaysis comparing the findings from the various studies. The most obvious and consequential limitations of this study are related to the use of the ECLS-K dataset, limitations which are shared by the Downey and Pribesh study as they are inherent to the dataset. Unfortunately, only the public use versions of the ECLSK data were available for use and these data sets were already edited by the NCES prior to being released to the public. Therefore, this study was unable to adequately evaluate 76 the types and amounts of missing data in the ECLS-K dataset. In addition to editing data to protect the students’ privacy, the NCES also replaced numerous variables with NCES created composite variables. Subsequently, all information on missing cases for the original variables was made unavailable to the public. Further, the NCES did admit to very low school response rates which led to the recruitment of replacement schools at departure from the original sampling frame. Even with these additional recruitment measures, the school completion rates were below NCES standards and prompted an internal nonresponse bias analysis. Although the NCES nonresponse bias analysis concluded that there was no bias due to school nonresponse, the current study was unable to independently test this claim due to the unavailability of data (Bose 2001; West, Denton, and Reaney 2001). As with all analysis of secondary data, one must trust that the sampling, data collection, and data coding functions were performed properly. However, one must still evaluate whether there are any potential risks to the reliability or validity of the data. In the ECLS-K, as with many large confidential datasets, there are several codes used for the various types of missing data. Still, not all situations fit neatly into these coding schemes and the secondary data analyst must use the data as it is available without having access to the actual responses. For example, according to the ECLS-K codebook (NCES 2004), children needing special accommodations due to physical or cognitive limitations were excluded from certain survey components and coded as “not applicable” on that variable. If a child was unable to complete a question with repeated instruction, the code “don’t know” was used. As a secondary data user, one must trust that the survey administrators were capable of determining whether the 77 child needed special accommodations and just did not know the answer. Unfortunately, many areas of limitation are inherent to the use of secondary data in a statistical analysis and are unavoidable without the collection of one’s own data. Another limitation to this study is that the methodological comparison used one analysis prepared by another author and two analyses prepared by the current author. As the MI results were obtained by Downey and Pribesh and the DML and LD results were obtained by this author, the differences in MI and DML results may be the result of minor differences in data preparation. One area of difference may be that the current author recoded variables so that only meaningful values were included in the analysis and all other values (such as refusals) were denoted as “missing”. Therefore, one may improve upon the current study by conducting a methodological comparison using only analyses with data prepared in exactly the same manner or by the same analyst. In summary, it is important to understand that every study will have certain strengths and weaknesses. It is even more important to evaluate a study and determine what these assets and limitations are prior to utilizing a secondary source or publishing one’s own findings for others to use. Due to the nature of the ECLS-K public use dataset, many of the limitations of this current study as well as that of Downey and Pribesh’s study were unavoidable if the ECLS-K data was to be used. Certain areas of weakness are due to the adherence to a particular theory, such as the Oppositional Culture Theory or teacher bias theory. While not all of a study’s limitations may be avoided, all should be revealed and discussed so that the substantive conclusions drawn from that study can be applied appropriately. 78 Impact on Future Research As indicated above, one of the primary strengths of this study is that is provides information on a neglected area in sociology: missing data methods. Missingness is considered inevitable in social research. Therefore, issues regarding the analysis of missing data need to be discussed, understood, and considered by the field of sociology. The understanding of such issues is imperative as traditional statistical methods will not perform properly using incomplete data. So, one must edit and adapt the dataset to fit a traditional method or use a method designed for incomplete data. Unfortunately, these methods are not well known nor commonly used. If an unsuitable method is used, it can result in unreliable and biased results and errors in performing even basic functions. This can lead to increased errors in hypothesis testing and thus distorted substantive conclusions. However, many sociologists are concerned with substantive issues and do not want to take the time or effort to consider technical and methodological issues. In addition, many do not have the expertise to recognize the issues regarding missing data methods. Thus, many simply use the method and program they are comfortable with and unknowingly base substantive conclusions on biased results. The current study not only fills a void in the existing literature regarding missing data methods, but also reveals the neglected area of handling incomplete data is sociological methods and opens up the discussion regarding these issues. Underlying issues were revealed regarding the importance of method choice and research design in sociological studies using incomplete data, areas which have been largely ignored in the field of sociology. First is the issue of reviewing and 79 understanding the dataset prior to conducting any analysis. Ideally, one would have a thorough knowledge of the dataset prior to even deciding what method of statistical analysis is to be performed and with what software package. This includes an understanding of the limitations of the particular dataset. Second is the issue of ensuring that the hypothesis one seeks to test is actually being tested by model chosen. The implications of this study on the field of social research will be an increased emphasis on teaching the importance of method choice and teaching more than just the traditional missing data methods in research methods courses. Further, it will lead to increased scrutiny of the methods used to treat incomplete data in published works. Although there are limitations to using the ECLS-K data in statistical analyses, there are numerous possibilities for future substantive research using the ECLS-K as a foundation. As Downey and Pribesh (2004) suggest, one can use the data collected in the ECLS-K study to evaluate whether the teachers’ or students’ race has an effect on the academic achievement of students or on the academic evaluations they receive. Due to the wealth of information contained in the ECLS-K, future analyses of these data can evaluate “the role of various things such as child care, home educational environment, teachers’ instructional practices, class size and the general climate, and facilities and safety of the schools” on areas such as changes in student academic achievement and performance (West, Denton, and Reaney 2001: xiii). The current literature suggests a focus on using research in standards based educational reform, and surveys such as the ECLS-K will provide the data by which school programs can be evaluated (Stanovich and Stanovich 2003). While there are many opportunities for future research using the 80 ECLS-K data, the fact remains that the limitations inherent to using this dataset must be thoroughly understood and taken into account to ensure that only valid substantive conclusions are based on these data. The statistical findings and substantive conclusions made by Downey and Pribesh in their 2004 study of the effects of students’ and teachers’ race on teachers’ evaluations of students’ classroom behaviors also provide many possibilities for future research in the areas of race, education, and child behavior. Though there were several apparent weaknesses in their study, Downey and Pribesh’s findings may serve as a viable jumping off point for future sociologists interested in studying these areas. As Downey and Pribesh point out in their article, their focus on classroom behaviors leaves room for research into the areas of academic achievement and grading. Given that the concepts of race and education are so broad and pervasive, the influence of Downey and Pribesh’s statistical results and substantive findings will not doubt lend support and influence in future sociological research. Conclusion While this current study’s methodological comparisons resulted in some unexpected findings, the results did generally support the hypothesis that MI and DML in application will result in the same substantive conclusions. All three methods resulted in findings which did support the substantive conclusions made by Downey and Pribesh that black students receive less favorable evaluations than white students due to teacher bias on the basis of race. After a careful review of the prior literature and Downey and Pribesh’s 2004 study, with a methodological comparison for the basis of evaluation, one 81 finds that the choice of method used for analysis and handling missing data is an important step in sociological research. It is evident that the choice of method can have an effect on statistical findings, and this study found that it can be fairly certain that DML should be used as the missing data method of choice whenever appropriate as it is generally easier to implement than MI and has more theoretical support in the literature than LD. While methodological considerations are an important foundation for a valid and unbiased study, it is also clear that the hypothesis guides the focus of the research and thus may draw attention to or away from the influence of independent variables. The influence of hypothesis on the focus of research can be seen in Downey and Pribesh’s lack of discussion regarding statistically significant independent variables other than race in their study. Theoretical and substantive issues will continue to guide sociological research; however it is evident that sociologists must begin to seriously consider methodological issues in order to ensure the reliability of future substantive conclusions based on statistical analyses using incomplete quantitative data. 82 REFERENCES Ainsworth-Darnell, James W. and Douglas B. Downey. 1998. “Assessing the Oppositional Culture Explanation for Racial/Ethnic Differences in School Performance.” American Sociological Review 63(4): 536-553. Alexander, Karl L., Doris R. Entwisle, and Maxine S. Thompson. 1987. “School Performance, Status Relations, and the Structure of Sentiment: Bringing the Teacher Back In.” American Sociological Review 52(5): 665-682. Allison, Paul D. 1987. “Estimation of Linear Models with Incomplete Data.” Sociological Methodology 17:71-103. ------. 1999. Multiple Regression: A Primer. Thousand Oaks, CA: Pine Forge Press. ------. 2000. “Multiple Imputation for Missing Data: A Cautionary Tale.” Sociological Methods and Research 28(3):301-309. ------. 2002. Missing Data. Sage University Papers series on Quantitative Applications in Social Sciences, 07-136. Thousand Oaks, CA: Sage. Arbuckle, James L. 2007. Amos 17.0 User’s Guide. [MRDF] Spring House, PA: Amos Development Corporation. (http://amosdevelopment.com). Baydar, Nazli. 2004. “Book Reviews.” Sociological Methods Research 33: 157-161. Bodovski, Katrina and George Farkas. 2008. “Concerted Cultivation and Unequal Achievement in Elementary School.” Social Science Research 37: 903–919. Bose, Jonaki. 2001. “Nonresponse Bias Analyses at the National Center for Education Statistics.” Proceedings of Statistics Canada Symposium 2001. Achieving Data Quality in a Statistical Agency: A Methodological Perspective. 83 Byrne, Barbara M. 2001a. Structural Equation Modeling with AMOS: Basic Concepts, Applications, and Programming. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. ------. 2001b. “Structural Equation Modeling with AMOS, EQS, and LISREL: Comparative Approaches to Testing for the Factorial Validity of a Measurement Instrument.” International Journal of Testing 1(1):55-86. Carter, Rufus Lynn. 2006. “Solutions for Missing Data in Structural Equation Modeling.” Research & Practice in Assessment 1(1):1-6. Cohen, Philip N. and Matt Huffman. 2007. “Working for the Woman? Female Managers and the Gender Wage Gap.” American Sociological Review 72: 681– 704. Condron, Dennis J. 2007. “Stratification and Educational Sorting: Explaining Ascriptive Inequalities in Early Childhood Reading Group Placement.” Social Problems 54(1):139-160. Collins, Linda M., Joseph L. Schafer, and Chi-Ming Kam. 2001. “A Comparison of Inclusive and Restrictive Strategies in Modern Missing Data Procedures.” Psychological Methods 6(4): 330-351. Cunningham, Everarda G. and Wei C. Wang. 2005. “Using AMOS Graphics to Enhance the Understanding and Communication of Multiple Regression.” IASE/ISI Satellite. Swineburne University of Technology, Australia. Downey, Douglas B. 2008. “Black/White Differences in School Performance: The Oppositional Culture Explanation.” Annual Review of Sociology 34:107-126. 84 Downey, Douglas B. and Shana Pribesh. 2004. “When Race Matters: Teachers’ Evaluations of Students’ Classroom Behavior.” Sociology of Education 77(4): 267-282. Downey, Douglas B., Paul T. von Hippel, and Beckett A. Broh. 2004. “Are Schools the Great Equalizer? Cognitive Inequality During the Summer Months and the School Year.” American Sociological Review 69(5): 613-635. Ehrenberg, Ronald G., Daniel D. Goldhaber, and Dominic J. Brewer. 1995. “Do Teachers’ Race, Gender, and Ethnicity Matter? Evidence from the National Educational Longitudinal Study of 1988.” Industrial and Labor Relations Review 48(3): 547-561. Eliason, Scott R. 1993. Maximum Likelihood Estimation: Logic and Practice. A Sage University Paper series on Quantitative Applications in the Social Sciences, 07096. Newbury Park, CA: Sage. Enders, Craig K. 2001. “A Primer on Maximum Likelihood Algorithms Available for Use With Missing Data.” Structural Equation Modeling 8(1):128-141. ------. 2006. “A Primer on the Use of Modern Missing-Data Methods n Psychosomatic Medicine Research.” Psychosomatic Medicine 68:427-436. Entwisle, Doris R., Karl L. Alexander, and Linda Steffel Olson. 2005. “First Grade and Educational Attainment by Age 22: A New Story.” American Journal of Sociology 10(5):1458–1502. Espeland, Mark A. 1988. “Review.” American Journal of Sociology 94(1): 156-158. 85 Espinosa, Linda M. and James M. Laffey. 2003. “Urban Primary Teacher Perceptions of Children with Challenging Behaviors.” Journal of Children and Poverty 9(2): 135-156. Farkas, George, Christy Lleras, and Steve Maczuga. 2002. “Does Oppositional Culture Exist in Minority and Poverty Peer Groups?”. American Sociological Review 67(1):148-155. Fay, Robert E. 1992. “When are Inferences from Multiple Imputation Valid?”. Proceedings of the Survey Research Methods Section, American Statistical Association. Pp. 227-232. Fowler, Floyd J. Jr. 2002. Survey Research Methods. 3rd ed. Applied Social Research Methods Series. Vol. 1. Thousand Oaks, CA: Sage Publications. Freedman, Vicki A. and Douglas A. Wolf. 1995. “A Case Study on the Use of Multiple Imputation.” Demography 32(3):459-470. Graham, John W., Scott M. Hofer, and Andrea M. Piccinin. 1994. “Analysis with Missing Data in Drug Prevention Research.” Pp. 13-63 in National Institute on Drug Abuse Research Monograph Series: Advances in Data Analysis for Prevention Intervention Research, edited by L.M. Collins and L.A. Seitz. Rockville, MD: National Institute on Drug Abuse. Gosa, Travis L. and Karl L. Alexander. 2007. “Family (Dis)Advantage and the Educational Prospects of Better Off African American Youth: How Race Still Matters.” Teacher’s College Record 109(9):285-321. 86 Horton, Nicholas J. and Stuart R. Lipsitz. 2001. “Multiple Imputation in Practice: Comparison of Software Packages for Regression Models with Missing Variables.” The American Statistician 55(3):244-254. Little, Roderick and Donald B. Rubin. 2002. Statistical Analysis with Missing Data 2nd Edition. Hoboken, NJ: John Wiley & Sons, Inc. Lleras, Christy. 2008. “Do skills and behaviors in high school matter? The contribution of noncognitive factors in explaining differences in educational attainment and earnings.” Social Science Research (37):888–902. Long, Barbara H. and Edmund H. Henderson. 1971. “Teachers’ Judgments of Black and White School Beginners.” Sociology of Education 44(3): 358-368. Mackelprang, A. J. 1970. “Missing Data in Factor Analysis and Multiple Regression.” Midwest Journal of Political Science 14(3): 493-505. Madow, William G, Harold Nisselson, and Ingram Olkin, eds. 1983. Incomplete Data in Sample Surveys, Vol. 1, Report and Case Studies. New York: Academic Press. McArdle, John J. 1994. “Structural Factor Analysis Experiments with Incomplete Data.” Multivariate Behavioral Research 29(4):409-454. National Center for Education and Statistics, U.S. Dept. of Education. 2004. ECLS-K Base Year Public-Use Data Files and Electronic Codebook. CD-ROM. NCES 2001-029 Revised August 2004. Rockville, MD: Westat. Navarro, Jose Blas. 2003. “Methods for the Analysis of Explanatory Linear Regression Models with Missing Data Not at Random.” Quality & Quantity 37:363-376. 87 Penn, David A. 2007.”Estimating Missing Values form the General Social Survey; An Application of Multiple Imputation.” Social Science Quarterly 88(2):573-584. Regoeczi, Wendy C. and Marc Riedel. 2003. “The Application of Missing Data Estimation Models to the Problem of Unknown Victim/Offender Relationships in Homicide Cases.” Journal of Quantitative Criminology 19(2): 155-183. Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63(3): 581-592. ------. 1978. “Multiple Imputations In Sample Surveys: A Phenomenological Bayesian Approach to Nonresponse.” Proceeding of the Survey Research Methods Section, American Statistical Association. Pp. 20-34. Washington, D.C. ------. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, Inc. ------. 1996. “Multiple Imputation After 18+ Years.” Journal of the American Statistical Association 91(434):473-489. Rudas, Tamas. 2005. “Mixture Models of Missing Data.” Quality & Quantity 39:19-36. Schafer, Joseph L. 1999. “Multiple Imputation: A Primer.” Statistical Methods in Medical Research 8: 3-15. Schafer, J. L., T. M. Ezzati-Rice, W. Johnson, M. Khare, R. J. A. Little, and D. B. Rubin. 1996. “The NHANES III Multiple Imputation Project.” Proceedings of the Survey Research Methods Section, American Statistical Association. Pp. 28-37. Schafer, Joseph L. and John W. Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7(2): 147-177. 88 Schafer, Joseph L. and Maren K. Olsen. 1998. “Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective.” Multivariate Behavioral Research 33(4): 545-571. Shernoff, David J. and Jennifer A. Schmidt. 2008. “Further Evidence of an Engagement–Achievement Paradox Among U.S. High School Students.” Journal of Youth Adolescence 37:564–580. Sinharay, Sandip, Hal S. Stern, and Daniel Russell. 2001. “The Use of Multiple Imputation For the Analysis of Missing Data.” Psychological Methods 6(4):317329. Stanovich, Paula J. and Keith E. Stanovich. 2003. “Using Research and Reason in Education: How Teachers Can Use Scientifically Based Research to Make Curricular & Instructional Decisions.” Portsmouth, New Hampshire: RMC Research Corporation. Stearns, Elizabeth and Elizabeth J. Glennie. 2006. “When and Why Dropouts Leave High School.” Youth Society 38(1): 29-57. Stumpf, Stephen A. 1978. “A Note On Handling Missing Data.” Journal of Management 4(1): 65-73. Tach, Laura Marie and George Farkas. 2006. “Learning-related Behaviors, Cognitive Skills, and Ability Grouping when Schooling Begins.” Social Science Research 35: 1048–1079. 89 Takei, Yoshimitsu and Roger Shouse. 2008. “Ratings in Black and White: Does Racial Symmetry or Asymmetry Influence Teacher Assessment of a Pupil’s Work Habits?”. Social Psychology of Education 11(4):367-387. West, Jerry, Kristin Denton, and Lizabeth M. Reaney. 2001. The Kindergarten Year: Findings from the Early Childhood Longitudinal Study, Kindergarten Class of 1998-99. National Center For Education Statistics, NCES 2001-023. Washington, DC: U.S. Department of Education. Wothke, Werner and James L. Arbuckle. 1996. “Full-information Missing Data Analysis with AMOS.” SPSS White Paper. Yuan, Yang C. 2000. “Multiple Imputation for Missing Data: Concepts and New Development.” P267-25. SAS White Papers. Rockville, MD: SAS Institute. Yuan, Ke-Hai and Peter M. Bentler. 2000. “Three Likelihood-Based Methods for Mean and Covariance Structure Analysis with Nonnormal Missing Data.” Sociological Methodology 30:165-200. Zeiser, Krissy. 2008. “PRI Workshop: Introduction to AMOS.” Pennsylvania State University, November 13, 2008.