Exploring configurational causation in large datasets with QCA: possibilities and problems Barry Cooper & Judith Glaesser School of Education, Durham University 3rd ESRC Research Methods Festival St Catherine’s College Oxford, 30 June – 3 July 2008 A note re these slides. • Some of these slides will be used in our presentation itself but some have been written to provide, as a context for the tables, etc., a preand post-festival web-based sketch of the method we have employed (Ragin’s Qualitative Comparative Analysis, or QCA) for any readers new to it. • After a brief description of the background to Ragin’s development of the set theoretic approach, and a list of what we see as its strengths, we will illustrate its use with large n data, drawing on our experience of using QCA (Cooper, 2005, 2006; Cooper & Glaesser, 2007, 2008, in press; Glaesser, forthcoming). • To keep things less complex than they would otherwise become, we will not draw attention, during this part of our presentation, to the more problematic issues that we wish to mention. • Instead, we deal with this aspect of our presentation after the illustration of the use of QCA in a large n context. Concerns about the dominant regression approach in quantitative analysis have a long history. Here, for example, are various remarks taken from Peter Abell’s 1971 book, Model Building in Sociology: • It is often (perhaps more often than not) the case that the covariation between sociological variables is not linear (p.174). • It was argued ... that interaction is a characteristic feature of sociological covariation (p.183). • Multicollinearity is pervasive in sociology; it is more often than not the case that explanatory variables are intercorrelated (p.189). • But from what was said earlier it might be expected that … (cardinal) variables will be of relatively rare occurrence in sociology. One is much more likely to encounter the situation where nominal and ordinal variables are related (p.197). • We have noted earlier that the typical causal situation in social science is one of over-determination – many different clusters of variables are sufficient for a given effect (p.236). Abell’s book also includes considerable discussion of the logic of necessary and sufficient conditions alongside his discussion of linear modelling. Several authors, from various perspectives, have raised important concerns about regression and its uses. For example (see attached bibliography for details): • • • • • • • • • • • • Boudon (1974a,b) Byrne (1998, 2002) Freedman (1987, 1997) Hedström (2005) Lieberson (1985) Morgan and Winship (2007) Ormerod (1998) Pawson & Tilley (1997) Pearl (2000) Ron (2002) Sörensen (1998) Taagepera (2005). Andrew Abbott (2001) has summarised some of the key assumptions of the linear model normally used in regression: • • • • The social world is made up of fixed entities with varying attributes (demographic assumption). – Some attributes determine (cause) others (attribute causality assumption). What happens to one case doesn't constrain what happens to others, temporally or spatially (casewise independence assumption). Attributes have one and only one causal meaning within a given study (univocal meaning assumption). Attributes determine each other principally as independent scales rather than as constellations of attributes; main effects are more important than interactions (which are complex types) (main effects assumption). Charles Ragin’s work Ragin (1987) shared many of the concerns of these various writers, but, in particular perhaps, focussed on Abbott’s third and fourth points, the relative neglect of causal heterogeneity and complex interaction in regression models when used in practice[1]. Using set theory rather than regression’s linear algebra as the basis for developing a configurational approach to causal modelling, he began to explore ways in which (i) complex interaction between causal factors and (ii) causal heterogeneity (i.e. the existence of several distinct types of cases in a ‘population’[2] and therefore of possible multiple pathways to an outcome) could be described in Boolean or configurational terms (Ragin, 1987, 2000, 2006a). In doing so, he also aimed to shift researchers’ practices away from a focus on the net average effects of variables (i.e. on which variables win the race to explain most variance) and towards an approach that recognised that events in the world are often caused by conjunctions of factors (Ragin, 2006b). It is his Qualitative Comparative Analysis (QCA) on which we focus in this paper. [1] On Abbott’s second point, see Hedström (2005). [2] The returns to cognitive capacity, for example, might differ systematically between social classes. Before introducing QCA in more detail, we might set out what we regard as the strengths of Ragin’s approach: 1. 2. 3. 4. 5. 6. A focus on cases and their constituent features rather than, as in regression, on abstracted variables (and therefore net – and often average – effects). Analysis of multiple and conjunctural causation in terms of necessary and/or sufficient conditions rather than in terms of the linear additive model. The recognition, up front, of the possibility of causal heterogeneity. The offer of a rigorous approach, drawing on set theory and logic, to the analysis of these features of social reality. Through a focus on INUS[1] conditions, the allowing, up front, of complex interactions between causes. The recognition of the problems resulting from limited diversity in social datasets. [1] An INUS condition is “an insufficient but non-redundant part of an unnecessary but sufficient condition” (Mackie, 1974). Boolean functional form: an example • Ragin’s QCA and its associated software use Boolean algebra to address conjunctural causation. Boolean equations have a different functional form to the regression equations with which social scientists are familiar. Here is an example taken from a paper contrasting the approaches (Mahoney & Goertz, 2006): • Y = (A*B*c) + (A*C*D*E) • In these equations the symbol * indicates Logical AND (set intersection),+ indicates Logical OR (set union), upper case letters indicate the presence of factors, lower case indicate their absence. In this fictional example of causal heterogeneity, the equation indicates that there are two causal paths to the outcome Y. The first, captured by the causal configuration A*B*c involves the presence in the case of features A and B, combined with the absence of C. The second, captured by A*C*D*E, requires the joint presence of A, C, D and E. Either of these causal configurations is sufficient for the outcome to occur, but neither is necessary, considered alone. A is necessary but not sufficient. The factor C behaves differently in the two configurations. This non-probabilistic - or veristic - example, of course, assumes no empirical exceptions to these relations. QCA: Sufficiency and quasi-sufficiency Sufficiency, understood causally or logically, involves a subset relation. If, for example, a single condition is always sufficient for an outcome to occur, the set of cases with the condition will be a subset of the set of cases with the outcome. This is shown in Figure 1 (next slide) based on a hypothetical relation between being of service class origin and achieving a degree. Given the condition, we obtain the outcome. In applications to real large n data, perfect sufficiency is unlikely to be found, and a situation like Figure 2 (next slide) will often be found, where most but not all of the set of cases with the condition also are members of the outcome set. Using conventional crisp sets, the proportion of the members of the condition set who are also members of the outcome set can be used as a measure of the degree of consistency of the empirical relation with a relation of perfect sufficiency (here: the number in the yellow subset divided by the number in the yellow and green subsets taken together). Figure 2 illustrates a relation that might be described as only ‘nearly always sufficient’. Alternatively, using a probabilistic view of causation, being of service class origin here could be said to be a sufficient condition, all else being equal, for raising the probability of achieving the outcome to a level equal to this “consistency” proportion. Figure 1: Perfect Sufficiency Figure 2: Quasi-Sufficiency QCA: Necessity & Coverage In Figure 3 (next slide), another hypothetical relation between being of service class origin and achieving a degree is shown. This is another example of less than perfect sufficiency. Here the members of the yellow fringe of the service class origin set are not also members of the outcome set. However, most members of this condition set are. This example is also, in fact, a special case in that being of service class origin is a necessary condition for achieving a degree (and in the case of necessity the outcome set is, as can be seen, a subset of the condition set, reversing the direction of the subsethood relation that characterises sufficiency). Venn diagrams can also illustrate Ragin’s concept of explanatory coverage (Ragin, 2006a). The proportion of the outcome set that is overlapped by the condition set can be used as a measure of the degree to which the outcome is covered (‘explained’) by the condition. In Figure 1 (previous slide), the coverage of the outcome of having a degree by the condition of being of service class origin can be seen to be low, with only around 40% of the (blue) outcome set covered by the (yellow) condition set. In Figure 3 (next slide), on the other hand, it can be seen that the whole of the outcome set (again in blue) is covered by the (yellow) condition set, and coverage is 100% (the arithmetic mark of a necessary condition in this simple case). Figure 3: Quasi-Sufficiency (but with perfect necessity) QCA: Multiple conditions and the partitioning of coverage: I In more complex set theoretic models with more than one condition, coverage can be partitioned in a manner analogous to the partitioning of variance explained in regression-based approaches (Ragin, 2006a). The partitioning of coverage into raw and unique components can be illustrated, again using imaginary data, by reference to a more complex Venn diagram (Figure 4, next slide). Here we have added the condition of being of high ability. In this fictional case we now have two crisp sets representing the conditions, ‘SERVICE CLASS ORIGIN’ and ‘HIGH ABILITY’, and the outcome is the achievement of a degree. The Boolean solution can be written as DEGREE = SERVICE CLASS ORIGIN + HIGH ABILITY. Either being of service class origin or of high ability is sufficient for the outcome (since both condition sets, considered separately, are subsets of the outcome set). Greater coverage of the outcome is achieved by having both of these factors in the analysis rather than either alone. Figure 4: Perfect Sufficiency (two conditions) QCA: Multiple conditions and the partitioning of coverage: II We can also see here how coverage can be partitioned straightforwardly in the case of crisp sets. In the case of the relations illustrated in Figure 4 (previous slide) it is easy to see that the total coverage can be broken into three components: – That due to being of service class origin while not being of high ability (the yellow subset as a proportion of the blue outcome set) – That due to being of high ability while not being of service class origin (the orange subset as a proportion of the blue outcome set) – That due to being of service class origin and being of high ability (the red subset as a proportion of the blue outcome set). If we take service class origin as an example, Ragin (2006a) would describe the first of these three (the yellow subset as a proportion of the outcome set) as the unique coverage due to being from this social class background. On the other hand, the coverage due to being of this class origin, whether or not this is conjoined with other causal conditions in the model (the yellow and red subsets taken together as a proportion of the outcome set), he would describe as the raw coverage due to membership in this set (being of service class origin). Parallel arguments apply to being of high ability. From this point on we employ real large n data in illustrating QCA in use. We can use data from the National Child Development Study (NCDS), comprising children born in one week in March 1958, to illustrate a multifactor conjunctural explanation[1]. Of course, we will not expect to find perfect sufficiency in the empirical world and our example will show how the method embodied in the software addresses this problem. We explore the relations between highest qualifications achieved by age 33 and a number of factors which might be seen as either causal or as summarising possible causes of achievement. To begin with we will take, as our outcome measure, having a highest level of qualification of at least ‘A’ level or its equivalent (HQUAL_ADVANCED). We wish to capture something more, when referring to social class origin, than one point in time, and so, for illustrative purposes, we will take father’s[2] social class at two points. We also include a measure of mother’s education and sex of the respondent. We will not include any measure of ability in this first example, in order to keep things simpler. [1] We will begin by using a subset of the data containing 3826 cases chosen to include no missing values on four measures of father’s class at different times and on mother’s education as well as other key variables. [2] We use father’s class because there are many more cases of missing/not-applicable data for mother’s class. However, we include a maternal influence via mother’s education. An illustrative Boolean analysis. We will address the Boolean equation: HQUAL_ADVANCED = function(MALE, PMT_FATHER_AT_BIRTH[1], PMT_FATHER_AT_AGE_11, MOTHER_POST_16_EDUCATED) where: HQUAL_ADVANCED MOTHER_POST_16_EDUCATED MALE PMT_FATHER_AT_BIRTH PMT_FATHER_AT_AGE_11 refers to having qualifications of at least ‘A’ level standard by age 33. refers to the mother having stayed on in education after age 16. refers to being male rather than female. refers to the mother’s husband being in a professional, managerial or technical position[2] at the time of the respondent’s birth. refers to the respondent’s father being in a professional, managerial or technical position when the respondent was aged 11. We should stress that we are not claiming that we have anything like a properly specified model of educational achievement here. Our purpose here is to illustrate QCA in use with large n data. [1] This is actually a measure of the mother’s husband in 1958, but to avoid unnecessary complexity (and given that this is usually the respondent’s father) we have used this description. [2] The PMT grouping used here comprises Classes I and II of the contemporary Registrar General’s scheme. Table 1: Proportions achieving HQUAL_ADVANCED by class origin, sex and mother’s education (NCDS data; n=3826) : a crosstabulation MOTHER’S EDUCATION AFTER 16 No PMT FATHER AT BIRTH PMT FATHER AT AGE 11 FEMALE MALE Yes Mean Count Mean Count No No .25 1193 .56 184 No Yes .44 158 .67 61 Yes No .37 51 .77 35 Yes Yes .60 120 .78 168 No No .40 1138 .57 164 No Yes .58 132 .78 64 Yes No .66 53 .82 34 Yes Yes .71 111 .79 160 QCA: Moving from the crosstab via a truth table to a Boolean solution The first step required is to reconfigure this as a truth table (next slide) where a “1” is entered to indicate the presence of a condition and a “0” to indicate its absence. In this table, where the rows are ordered by the measure of consistency with sufficiency, the first row (1101), for example, represents the causal configuration: MALE*PMT_FATHER_AT_BIRTH*pmt_father_at_age_11*MOTHER_POST_16_EDUCATED with the upper case letters indicating membership in a set and lower case letters non-membership. The proportion of the 34 cases in this configuration who achieve the outcome, i.e. 0.824, appears in the consistency column. The second step is to determine a threshold for quasi-sufficiency and, in the light of this decision, to enter a “1” into the empty outcome (HQUAL_ADVANCED) column against each row (or causal configuration) for which the consistency proportion in the final column passes the threshold set. This decision determines which configurations are allowed into the final solution. Table 2: Truth table for achieving HQUAL_ADVANCED (NCDS data, n=3826) MALE PMT_FATHER_AT _BIRTH PMT_FATHER_AT _AGE_11 MOTHER_POST_ 16_EDUCATED 1 1 0 1 1 1 1 1 0 0 number HQUAL_ ADVANCED Consistency 34 1 0.824 1 160 1 0.794 1 1 64 1 0.781 1 1 1 168 1 0.780 0 1 0 1 35 1 0.771 1 1 1 0 111 1 0.712 0 0 1 1 61 1 0.672 1 1 0 0 53 1 0.660 0 1 1 0 120 0 0.600 1 0 1 0 132 0 0.576 1 0 0 1 164 0 0.573 0 0 0 1 184 0 0.560 0 0 1 0 158 0 0.437 1 0 0 0 1138 0 0.397 0 1 0 0 51 0 0.373 0 0 0 0 1193 0 0.247 Three types of cases? The decision re a threshold also effectively determines which cases, seen as captured by configurations of conditions, will be grouped together in the final solution. In this illustration we will assume that there are three levels of outcome that we wish to understand in configurational terms: – – – Those configurations – or sets of cases – in which more than 60% of the cases achieve the outcome. Passing this consistency level might be argued to be consistent with this level of outcome approaching being more or less the norm for these configurations. These configurations are also those we might want to allow forward into a solution for quasisufficiency. Those configurations (sets of cases) in which fewer than 40% of the cases achieve the outcome. This level might be seen as making not achieving this level of outcome more or less the norm for these configurations. The remaining configurations (sets of cases) in which 40% - 60% of the cases achieve the outcome. In these configurations neither achieving nor not achieving the outcome is the norm. Clearly, these decisions require judgements to be made. The reader will see that it is easy to explore other analyses based on other boundaries. The first group of cases. Let us turn to the first group. These configurations have been picked out by entering 1s and 0s in Table 2 in the HQUAL_ADVANCED column. Table 3a (next slide) shows the solution that results when fs/QCA is asked to minimise the configurations picked out by these 1s. These eight rows (‘causal configurations’) are subjected to an algebraic process of Boolean minimisation[1] (Quine, 1952; Ragin, 1987) in order to create the final simplest solution: MALE*PMT_FATHER_AT_BIRTH + PMT_FATHER_AT_BIRTH*MOTHER_POST_16_EDUCATED+ PMT_FATHER_AT_AGE_11* MOTHER_POST_16_EDUCATED The two final expressions pick out cases whose mothers had stayed on after 16 and had a father figure in the PMT class at one point of two in their childhood. Both males and females are included in these expressions. The first expression picks out just males who were born into a family setting with a father in the PMT class at birth. [1] This proceeds as follows. Taking the first two rows as an example, we have 1101 and 1111. Clearly, at the level of quasi-sufficiency we have chosen the presence or absence of the third element makes no difference. We can therefore replace it with a dash to indicate this, giving 11-1. A similar argument can be applied to the fourth and fifth rows (0111 and 0101) to give 01-1. Taking 11-1 and 01-1 together, and continuing the process we arrive at -1-1. This is PMT_FATHER_AT_BIRTH* MOTHER_POST_16_EDUCATED, one of the terms in our final solution. Table 3a: Solution for those for whom achieving at least ‘A’ level qualifications is more or less the norm (> 0.60 do so in each row allowed forward) --- TRUTH TABLE SOLUTION --raw unique coverage coverage consistency -------- ------- ---------MALE*PMT_FATHER_AT_BIRTH+ 0.158 0.067 0.751 PMT_FATHER_AT_BIRTH*MOTHER_POST_16_EDUCATED+ 0.184 0.016 0.788 PMT_FATHER_AT_AGE_11*MOTHER_POST_16_EDUCATED 0.206 0.054 0.770 solution coverage: 0.305 solution consistency: 0.755 Table 3b: Solution for those for whom not achieving at least ‘A’ level qualifications is more or less the norm (< 0.40 do so in each row allowed forward) --- TRUTH TABLE SOLUTION --raw coverage ------- unique coverage --------- consistency ----------- pmt_father_at_birth*pmt_father_at_age_11 *mother_post_16_educated+ 0.440 0.266 0.320 male*pmt_father_at_age_11 *mother_post_16_educated 0.185 0.011 0.252 solution coverage: 0.451 solution consistency: 0.322 Table 3c: Solution for those for whom neither achieving nor not achieving at least ‘A’ level qualifications is the norm ( 0.60 & 0.40 do so in each row allowed forward) --- TRUTH TABLE SOLUTION --raw unique coverage coverage consistency -------- ------- ----------pmt_father_at_birth*pmt_father_at_age_11 *MOTHER_POST_16_EDUCATED+ 0.116 0.116 0.566 MALE*pmt_father_at_birth*PMT_FATHER_AT_AGE_11 *mother_post_16_educated+ 0.045 0.045 0.576 male*PMT_FATHER_AT_BIRTH*PMT_FATHER_AT_AGE_11 *mother_post_16_educated 0.042 0.042 0.600 solution coverage: 0.203 solution consistency: 0.575 QCA: an example of a quasi-necessary condition: I It might be thought, at least for some hypothesised meritocracy, that were academic ability to be appropriately defined and measured then some minimum level of this factor ought to be a necessary condition for anyone to achieve a degree. Table 4a illustrates this, where one cell should be empty if the chosen level of ability (X) is a strictly necessary condition for a degree to be achieved. Here, we might be seen as assuming causal homogeneity for the factor of ability. Table 4a: Strict necessity of some level of ability (X) for achieving a degree Achieves degree Ability No Yes <X Cases possible Empty X Cases possible Cases possible QCA: an example of a quasi-necessary condition: II An examination by eye of the NCDS distribution of the proportions achieving a degree at each point of the ability scale allows us to estimate what such a level of ability might be empirically, for all respondents taken together. It is, in fact, around the mean ability score and if we create a factor setting ability as either over or under the mean score for our subset of 3826, we obtain Table 4b, showing that the proportion of those obtaining a degree whose ability score is below the mean is only 10.4%. Especially given that this proportion may include cases where the measurement was low through either error or chance factors, we might be willing to say that a score above the mean approaches being a necessary condition for achieving a degree in this sample and is therefore a quasi-necessary condition. Table 4b: Achieving a degree by ability below and above the mean row (column %)[1] Achieves degree No Ability [1] Yes < Mean 1748 (52.7%) 53 (10.4%) > Mean 1568 (47.3%) 457 (89.6%) As it happens this test only has discrete scores, from 0 to 80. The mean lies between two of these scores. QCA: an example of a quasi-necessary condition: III However, we can not be satisfied with this conclusion which, as we said, effectively assumes causal homogeneity, with ability operating in the same way across all types of cases and, of course, leaves us wondering about the features of the cases amongst the 10.4%. We obviously want to know whether there are sets of cases – perhaps, for example, differentiated by social class - for whom being either above or below the mean, when conjoined with other factors, is either necessary and/or sufficient or not for achieving a degree (or quasi-necessary or quasi-sufficient), especially as apparent returns to ability vary by class, as Figure 5 (next slide), produced using a slightly different class origin categorisation, clearly shows. Figure 5: Proportions gaining a degree by ability at age 11 and social class Proportion gaining degree (subset of NCDS, n=3826) 0.700 0.600 Proportion 0.500 service 0.400 intermediate 0.300 working 0.200 0.100 0.000 <=17 1824 2531 3238 3945 4652 5359 Ability Score (age 11) 6066 >66 QCA: an example of a quasi-necessary condition: IV • To explore these questions, we might undertake an analysis that includes a measure of ability being over the mean, given what we found in Table 4b. Let us undertake an analysis of: HQUAL_DEGREE = function (ABILITY_ABOVE_MEAN, MALE, PMT_FATHER_AT_BIRTH, PMT_FATHER_AT_AGE_11, MOTHER_POST_16_EDUCATED). • The relevant truth table is shown in Table 5 (next slide), with the rows ordered by consistency. We can see that the first five rows have a consistency level of 0.40 or above, which we might label as implying that for these cases, gaining a degree is, all else being equal, a definite possibility, something that is a pretty common occurrence in their milieus. Each of these configurations is characterised by having ability above the mean, but conjoined with several supportive paternal and maternal ascriptive factors, and, in most cases, with male sex. The minimised solution for these rows is shown in Table 6 (two slides on) where ABILITY_ABOVE_MEAN appears, as a necessary condition should, in each expression. • We will return to the somewhat paradoxical threshold-dependent sense which the term “necessary” has in this claim after a subsequent example. Table 5 ABILITY_ ABOVE_MEAN PMT_FATHER_AT _BIRTH MALE PMT_FATHER_ AT_AGE_11 MOTHER_POST_16 _EDUCATED HQUAL_ DEGREE number consistency 1 1 1 1 0 68 0.485 1 1 1 1 1 126 0.484 1 1 1 0 1 25 0.440 1 0 1 1 1 144 0.424 1 1 0 1 1 48 0.417 1 0 0 1 1 47 0.340 1 0 1 0 1 24 0.292 1 0 1 1 0 93 0.290 1 0 0 0 1 127 0.283 1 1 0 1 0 75 0.253 1 1 0 0 1 101 0.238 0 1 1 1 1 34 0.206 1 0 0 1 0 99 0.192 1 1 1 0 0 28 0.179 1 1 0 0 0 451 0.175 0 1 0 0 1 63 0.143 0 0 1 1 1 24 0.125 0 1 0 1 1 16 0.125 1 0 1 0 0 35 0.114 0 1 1 0 0 25 0.080 0 0 0 1 1 14 0.071 1 0 0 0 0 534 0.066 0 0 1 0 0 16 0.063 0 1 0 1 0 57 0.053 0 0 0 1 0 59 0.034 0 1 1 1 0 43 0.023 0 1 0 0 0 687 0.019 0 0 0 0 1 57 0.018 0 0 0 0 0 659 0.012 0 0 1 1 0 27 0.000 0 0 1 0 1 11 0.000 0 1 1 0 1 9 0.000 Table 6: Minimised solution for Table 5, for first five rows --- TRUTH TABLE SOLUTION --frequency cutoff: 9.000 consistency cutoff: 0.417 raw coverage unique coverage consistency -------- ---------- ----------- ABILITY_ABOVE_MEAN*MALE*PMT_FATHER_AT_BIRTH *PMT_FATHER_AT_AGE_11+ 0.184 0.065 0.485 ABILITY_ABOVE_MEAN*MALE* PMT_FATHER_AT_BIRTH *MOTHER_POST_16_EDUCATED + 0.141 0.022 0.477 ABILITY_ABOVE_MEAN*MALE*PMT_FATHER_AT_AGE_11 *MOTHER_POST_16_EDUCATED + 0.159 0.039 0.466 ABILITY_ABOVE_MEAN*PMT_FATHER_AT_BIRTH *PMT_FATHER_AT_AGE_11*MOTHER_POST_16_EDUCATED 0.239 0.120 0.452 solution coverage: 0.365 solution consistency: 0.453 QCA: an example of a quasi-necessary condition: V A further inspection of Table 5 shows, as we might expect, that having this level of ability characterises the top half of the ordered table (14 out of the 16 rows). However, there are exceptions. The first, in the twelfth row, is the configuration, with only 34 cases ability_above_mean*MALE*PMT_FATHER_AT_BIRTH *PMT_FATHER_AT_AGE_11*MOTHER_POST_16_EDUCATED This conjunction of lower ability with supportive ascriptive factors is associated with some 20.6% achieving a degree, some way above the mean of 13.3%. QCA: an example of a quasi-necessary condition: VI We might be especially interested in exploring what it is about those with lower than mean ability that might explain their achieving proportionally more degrees than expected. It is likely, as we can see from this example, to be the presence of supporting ascriptive factors. However, the numbers become very small in some of the relevant rows in Table 5. For this reason, we will explore this question using a different boundary within the ability scale. Sixty-one percent of those achieving degrees in the 3826 have ability in the top 20% of the overall distribution in the NCDS (see Table 7). We can use the remaining 39% to explore what factors, conjoined with being outside the top 20% are associated with raising the proportion gaining a degree. We will define, for current purposes, ability in the top 20% as “high ability”. Table 7: Degrees by High Ability (i.e. ability in top 20%) (column %) Not of High Ability Of High Ability No Degree Has Degree 2693 (81.2%) 199 (39.0%) 623 (18.8%) 311 (61.0%) QCA: an example of a quasi-necessary condition: VII Therefore let us undertake a Boolean analysis parallel to the earlier one but that excludes the top 20% of the ability range. Table 8 (next slide) is the relevant truth table, ordered by consistency. A glance at this shows that, for these cases, mother’s education is a key factor in raising the likelihood of a degree. If we set a 0.20 threshold to explore this (having noted the jump from 0.16 to 0.20 in the consistency column), we obtain the solution in Table 9 (two slides on). Within the confines of this analysis, i.e. for those not of high ability as defined, MOTHER_POST_16_EDUCATED is necessary to raise the proportion obtaining a degree to 20%, as is also a father’s class position in the PMT classes for at least one of the two points included. However, the low coverage figure for the solution should be noted (0.296). Amongst those not of high ability as defined, more degrees (140) are gained by individuals outside of the configurations included in this solution than by those within them (59). It must therefore be stressed that the sense of necessary here is necessary to raise the proportion for a configuration to 0.2 or better and not the sense that it is not possible for an individual to gain a degree without a suitably educated mother. Many do precisely the latter. Table 8: Degree by sex, class and mother’s education (only for those whose ability is outside the top 20%) MALE PMT_FATHER_ AT_BIRTH PMT_FATHER_ AT_AGE_11 MOTHER_POST_ 16_EDUCATED number HQUAL_ DEGREE consistency 1 1 1 1 88 0.261 1 0 1 1 39 0.256 0 0 1 1 29 0.241 0 1 1 1 74 0.203 0 1 0 1 20 0.200 1 0 0 1 109 0.156 1 1 0 0 42 0.143 1 1 1 0 72 0.139 0 0 0 1 125 0.120 1 1 0 1 17 0.118 0 1 1 0 73 0.110 1 0 1 0 93 0.097 0 0 1 0 120 0.083 0 1 0 0 41 0.073 1 0 0 0 963 0.047 0 0 0 0 987 0.015 Table 9: Degree by sex, class and mother’s education (only for those whose ability is outside the top 20%) --- TRUTH TABLE SOLUTION --- frequency cutoff: 17.000 consistency cutoff: 0.200 raw unique coverage coverage consistency ----------------------------- PMT_FATHER_AT_AGE_11 *MOTHER_POST_16_EDUCATED+ 0.276 0.201 0.239 male*PMT_FATHER_AT_BIRTH *MOTHER_POST_16_EDUCATED 0.095 0.020 0.202 solution coverage: 0.296 solution consistency: 0.236 QCA: Limited Diversity in Datasets and Counterfactual Reasoning In the examples we have used above, and with the number of conditions employed in those models, we did not experience the problem of very small numbers in some rows of the truth table that can arise with more conditions as a consequence of (i) the exponential increase in the number of rows as more conditions are included and (ii) the relations – or correlations - between conditions in the empirical world (Ragin & Sonnett, 2005). Small numbers of cases in some configurations constitute a problem because it is difficult to make a valid statement about a group of cases who, empirically, only appear in small numbers. In regression analyses, since the weight of the various combinations of scores on variables is taken into account in calculating average net effects, this problem is effectively dealt with mechanically, partly via the use of significance tests. Ragin has suggested a range of ways of using counterfactual reasoning to address the problems caused by limited diversity. For our use of these approaches with the NCDS data, which we will not have time to discuss, see Cooper & Glaesser (2008). QCA: Some Problems in its Use With Large Datasets We will introduce here some of the problems and issues that arise for us in using QCA with large n data. We will begin with problems that are not peculiar to QCA since they parallel the correlation / causation problem in conventional quantitative analyses. We will then discuss some problems that are more QCAspecific, though, to some extent, it must be remembered, these may be a consequence of its relatively recent development. Unlike regression, QCA has not been under development for more than a century! Although we may, and certainly should, have inserted some ‘cautious’ words (‘potentially’, ‘possible’, etc.) before the word causal at various places in this talk, we have not yet addressed the question of whether QCA, as an analytic tool, is able to avoid analogous problems to those associated with moving from correlations to causal claims in the regression approach. Clearly, we might enter into a Boolean model a ‘condition’ that we then found to be logically necessary, for example, for some outcome, but which we would not want to regard as truly causal. Two types of such conditions are worth distinguishing. QCA: non-causal conditions I Alcohol might be a necessary (and causal) condition for drunkenness, but, in a society in which it was always mixed with tonic water, we would want to be able to reject a claim (which QCA could obviously deliver, if used mechanically) that tonic water was a necessary causal condition for drunkenness. We would do this, presumably, by reference to existing theoretical knowledge, preferably of the mechanisms and processes involved in the production of drunkenness and/or by comparisons with other sets of findings where tonic water was not mixed with alcohol, etc[1]. [1] Cartwright (2007) provides a formal treatment of this correlation/causation problem in the context of QCA. QCA: non-causal conditions II To avoid problems of infinite regress, we would want to be able to distinguish some types of causal necessary conditions from others. It may well be necessary for oxygen to be present in order for degrees to be achieved, but we wouldn’t normally expect to address this in an analysis of educational achievement. Mackie’s (1974) concept of the “causal field” provides a way of addressing this potential problem. This field acts as a background context which absorbs the causal factors we would not expect to see referred to as part of an explanation of some particular outcome under examination. QCA: non-causal conditions III Having noted these problems, we would nevertheless want to argue that, in our earlier analyses, there are plausible mechanisms implied by such summarising conditions as social class. These conditions (class, ability, etc.) or, at least, the more specific factors they summarise, are plausible causal factors. Furthermore, when addressing some evaluative questions (e.g. is Britain a meritocracy?), the question itself, once its constituent terms are defined, usually points to the relevant factors to include in a configurational analysis (Cooper, 2005, 2006). QCA: Underdetermination of theory by data, etc. We might find in some population that being in the set male*WORKING_CLASS is perfectly sufficient for NOT achieving a given level of educational qualification. However, whether this is due to working class females lacking some capacity or disposition required to cope with the appropriate curriculum or whether, on the other hand, some form of educational apartheid ensures that no working class female is allowed to enter the institution offering the curriculum, clearly can not be read off from the Boolean expression. Of course, other Boolean models perhaps could be used to provide part of the answer (exploring what happens to other females, to working class males; including dispositional factors) but, ideally, we need knowledge of the processes and mechanisms that generate the observed outcomes. Nothing in Ragin’s work, we should note, suggests that he thinks otherwise. QCA: problems to do with randomness We might find that the configuration HIGH_ABILITY * SERVICE_CLASS has a consistency with sufficiency of, say, 0.90, for achieving some outcome, thereby reaching a level that Ragin would regard as indicating quasi-sufficiency. However, is this gap between 1.00 and 0.90 to be explained by our having the equivalent of an underspecified model in a regression analysis (e.g. perhaps some missing ascriptive factors or a lack of factors concerning ‘choice’) or by the existence of stochastic elements in the social world (and/or measurement or sampling error)? In the former case, there exists some causal heterogeneity yet to be picked out by the conditions entered in the model. It might be that HIGH_ABILITY * SERVICE_CLASS * MALE has perfect consistency with sufficiency, for example. This would leave us, however, with HIGH_ABILITY * SERVICE_CLASS * male having a lower consistency than 0.90 and return us to the same question again, but this time just for females. QCA and counterfactualist perspectives of causation A counterfactualist perspective on causation (e.g. Morgan & Winship, 2007) could be used to raise questions about some QCA-derived claims re causality in the same way it raises questions about some regression-based forms of analysis that basically use a branch of mathematics to describe relations in datasets[1]. On the other hand, a move from a net effects perspective (one assuming independently manipulable independent variables) to one emphasising conjunctural causation might be expected to make it less likely that unjustified counterfactual claims are made by policy makers on the basis of research findings, especially about the effects of intervening to change a single factor without taking account of its context. [1] For a relevant and interesting exchange of views, see Ragin & Rihoux, 2004a,b; Lieberson, 2004; Seawright, 2004; Mahoney, 2004. More QCA-specific issues: inference from samples to populations I The first point concerns work that uses samples from some population. This is usually the situation we find ourselves in when working with large datasets. Although attempts have been made (e.g. in earlier version of the fs/QCA software) to incorporate significance testing (see also Ragin, 2000, and Smithson and Verkuilen, 2006), this is an area requiring more work. Especially when numbers become small in some rows of a truth table, and especially when survey data are being used, a critic will always be able to ask whether sampling (or measurement) error has been taken into account. Although we have considerable sympathy with the view that judgement should play a role in these situations – especially as significance tests are frequently employed when the conditions for their use are not met – we also recognise that more work on incorporating significance testing into QCA would be useful, simply because chance always offers a potential threat to any analytic claim we might make. But, note that Ragin (1987, 2000) has a different perspective on ‘populations’ to the one implied here. More QCA-specific issues: inference from samples to populations II A related problem we have ignored during the talk so far is that of missing data. Can we assume that the Boolean solutions we have presented, often based on smallish subsets of the whole NCDS (because of the missing data problem) would hold for the NCDS as a whole? This would seem unlikely unless the missing data have been generated by random rather than systematic processes. Of course, it is possible to undertake some simple checks to see whether any bias is likely to have been introduced. It is also possible to use sophisticated techniques (multiple imputation, etc.) to replace missing data, but such approaches require considerable faith in the very linear models that Ragin and others have argued are often unhelpful in the social world. This is a difficult problem to which we intend to give further thought. More QCA-specific issues: case knowledge (or its lack) in large n contexts We lack, in the traditional sense, the detailed case knowledge that Ragin argues is required to undertake QCA. The NCDS, in one sense, does contain a mass of data on each individual respondent but, for example, (i) it is collected via techniques that are likely to generate considerable error and, (ii) it is not possible for us to return to the respondent to correct likely errors or to seek new data from earlier periods as analyses develop. More QCA-specific issues: quasi as opposed to perfect necessity and sufficiency Repeating what we said earlier there is the question of whether and when it makes sense to ever stop at quasilevels of consistency, i.e. to ignore the deviant cases in a row (or to allow a ceteris paribus clause). More generally, the use of weak implication (quasisufficiency and quasi-necessity as opposed to sufficiency and necessity) deserves more discussion (but see Abell, 1971, and also Goertz, 2005; Waldner, 2005; Sekhon, 2005 for a recent exchange). We’ve raised a lot of problems here, though we ourselves believe QCA to be a very important addition to the armoury of the social scientist interested in exploring potentially causal relations. The fuzzy set variety of QCA allows the conjunctural perspective to be brought to bear more finely than the crisp set version we have discussed here, but, inevitably, given the nature of fuzzy sets and logic, brings along some additional problems (many addressed in Ragin’s own account in Fuzzy Set Social Science). We are looking forward to further developments of these methods and, in particular, to Ragin’s forthcoming new book Redesigning Social Inquiry: Fuzzy Sets and Beyond. References Abell, P. (1971) Model Building in Sociology. London: Weidenfeld & Nicolson. Abbott, A. (2001) Time Matters. London & Chicago: Chicago University Press. Boudon, R. (1974a) The logic of sociological explanation. Harmondsworth: Penguin. Boudon, R. (1974b) Education, Opportunity and Social Inequality. NY: Wiley-Interscience. Byrne, D. (1998) Complexity Theory and the Social Sciences. London: Routledge. Byrne, D. (2002) Interpreting Quantitative Data. London: Sage. Cartwright, N. (2007) Hunting Causes and Using Them; Approaches in Philosophy and Economics. Cambridge: Cambridge University Press. Cooper, B. (2005) Applying Ragin’s crisp and fuzzy set QCA to large datasets: social class and educational achievement in the National Child Development Study. Sociological Research Online. 10, 2 <http://www.socresonline.org.uk/10/2/cooper.html> Cooper, B. (2006) Using Ragin’s Qualitative Comparative Analysis with longitudinal datasets to explore the degree of meritocracy characterising educational achievement in Britain. Paper presented to the Sociology of Education SIG at the Annual Meeting of the American Educational Research Association, San Francisco. Cooper B. and Glaesser, J. (2007) Exploring Social Class Compositional Effects on Educational Achievement with Fuzzy Set Methods: A British Study. Paper presented to the Sociology of Education SIG at the Annual Meeting of the American Educational Research Association, Chicago. Cooper B. & Glaesser, J. (2008) Exploring alternatives to the regression analysis of quantitative survey data in education: what does the configurational approach have to offer? Paper presented at the Annual Meeting of the American Educational Research Association, New York. Cooper B. & Glaesser, J. (in press) How has educational expansion changed the necessary and sufficient conditions for achieving professional, managerial and technical class positions in Britain? A configurational analysis. Sociological Research Online. Freedman, D.A. (1987) As others see us: a case study in path analysis. Journal of Educational Statistics. 12, 2, 101-128. Freedman, D.A. (1997) From association to causation via regression. In McKim, V.R. & Turner, S.P. (Eds) Causality in Crisis? Statistical Methods and the Search for Causal knowledge in the Social Sciences. Notre Dame, Indiana: University of Notre Dame Press. Glaesser, J. (forthcoming, 2009) Just how flexible is the German selective secondary school system? A configurational analysis. International Journal of Research and Method in Education. Goertz, G. (2005) Necessary condition hypotheses as deterministic or probabilistic: does it matter? Qualitative Methods: Newsletter of the American Political Science Association Organized Section on Qualitative Methods. Spring 2005, 22-27. Gorard, S. (2006) Towards a judgement-based statistical analysis. British Journal of Sociology of Education. 27, 1, 67-80. Hauser, R. (1976) On Boudon’s model of social mobility. The American Journal of Sociology. 81, 4, 911-928 Hedström, P. (2005) Dissecting the Social: On the Principles of Analytical Sociology. Cambridge: Cambridge University Press. Lieberson, S. (1985). Making it Count: the improvement of Social Research and Theory. Berkeley: University of California Press. Lieberson, S. (2004) Comments on the use and utility of QCA. In Qualitative Methods: Newsletter of the American Political Science Association Organized Section on Qualitative Methods. Fall 2004, Vol. 2, No. 2, 13-14. Mackie, J. (1974) The Cement of the Universe. Oxford: Clarendon Press. Mahoney, J. (2001) Beyond correlational analysis: recent innovations in theory and method. Sociological Forum. 16, 3 ,575-593. Mahoney, J. (2004) Reflections on fuzzy-set/QCA. In Qualitative Methods: Newsletter of the American Political Science Association Organized Section on Qualitative Methods. Fall 2004, Vol. 2, No. 2, 17-21. Mahoney, J. & Goertz, G. (2006) A tale of two cultures: contrasting quantitative and qualitative research. Political Analysis, 14, 3, 227-249. Morgan S.L. & Winship, C. (2007) Counterfactuals and Causal Inference: Methods and Principles for Social Research. Cambridge: Cambridge University Press. Ormerod, P. (1998) Butterfly Economics. London: Faber and Faber. Pawson, R. & Tilley, N. (1997) Realistic Evaluation. London: Sage. Pearl, J. (2000) Causality: models, reasoning and inference. Cambridge: Cambridge University Press. Quine, W.V. (1952) The problem of simplifying truth functions. American Mathematical Monthly, Vol. 59, No. 8, pp. 521-531. Ragin, C.C. (1987) The comparative method. Berkeley & Los Angeles: California University Press. Ragin, C.C. (2000) Fuzzy set social science. Chicago: Chicago University Press. Ragin, C.C. (2003) Recent advances in fuzzy-set methods and their application to policy questions. <http://www.compasss.org/Ragin2003.PDF>. Ragin, C.C. (2005) From fuzzy sets to crisp truth tables. <http://www.compasss.org/Raginfztt_April05.pdf > Ragin, C.C. (2006a) Set relations in social research: evaluating their consistency and coverage. Political Analysis. 14, 291-310. Ragin, C.C. (2006b) The limitations of net effects thinking. In Rihoux, B. & Grimm, H. (Eds) Innovative Comparative Methods for Political Analysis, NY: Springer. Ragin, C.C. & Rihoux, B. (2004a) Qualitative Comparative Analysis (QCA): state of the art and prospects. In Qualitative Methods: Newsletter of the American Political Science Association Organized Section on Qualitative Methods. Fall 2004, Vol. 2, No. 2, 3-13. Ragin, C.C. & Rihoux, B. (2004b) Replies to commentators: reassurances and rebuttals. In Qualitative Methods: Newsletter of the American Political Science Association Organized Section on Qualitative Methods. Fall 2004, Vol. 2, No. 2, 22-24. Ragin, C.C. and Sonnett, J. (2005) Between complexity and parsimony: limited diversity, counterfactual cases, and comparative analysis. In Kropp, S. And Minkenberg, M. (Eds) Vergleichen in der Politikwissenschaft. Wiesbaden:VS Verlag für Sozialwissenschaften. Ragin, C.C., Rubinson, C., Schaefer, D., Anderson, S., Williams, E. and Giesel, H. (2006) User's Guide to Fuzzy-Set/Qualitative Comparative Analysis 2.0. Tucson, Arizona: Department of Sociology, University of Arizona. Ron, A. (2002) Regression analysis and the philosophy of social science: a critical realist view. Journal of Critical Realism. 1, 1, 119-142. Rothman K.J. (1976) Causes. American Journal of Epidemiology. 104, 6, 587-592. Seawright, J. (2004) Qualitative comparative analysis vis-à-vis regression. In Qualitative Methods: Newsletter of the American Political Science Association Organized Section on Qualitative Methods. Fall 2004, Vol. 2, No. 2, 14-17. Sekhon, J.S. (2005) Probability tests require distributions. Qualitative Methods: Newsletter of the American Political Science Association Organized Section on Qualitative Methods. Spring 2005, 29-30. Smithson, M. & Verkuilen, J. (2006) Fuzzy Set Theory: Applications in the Social Sciences. London: Sage. Sörensen, A. (1998) Theoretical mechanisms and social processes. In Hedström, P. & Swedberg, R. (Eds) Social Mechanisms: an analytical approach to social theory. Cambridge: Cambridge University Press. Taagepera, R. (2005) Predictive versus postdictive models. Paper presented to the 3rd conference of the European Consortium for Political Research. Budapest, September 2005. Waldner, D. (2005) It ain’t necessarily so – or is it? Qualitative Methods: Newsletter of the American Political Science Association Organized Section on Qualitative Methods. Spring 2005, 27-29.