BRUCE GLYMOUR DATA AND PHENOMENA: A DISTINCTION RECONSIDERED ABSTRACT. Bogen and Woodward (1988) advance a distinction between data and phenomena. Roughly, the former are the observations reported by experimental scientists, the latter are objective, stable features of the world to which scientists infer based on patterns in reliable data. While phenomena are explained by theories, data are not, and so the empirical basis for an inference to a theory consists in claims about phenomena. McAllister (1997) has recently offered a critique of their version of this distinction, offering in its place a version on which phenomena are theory laden, and hence on which the empirical support for inferences to theories is also, unavoidably, theory laden. In this commentary I argue that McAllister and Bogen and Woodward are mistaken in thinking that the distinction is necessary, and that the empirical support for inferences to theories is not necessarily theory laden in the way McAllister’s account entails they are. Bogen and Woodward in their (1988) are concerned to address three problems about the nature of scientific evidence. (1) Observations are taken to provide reason to accept theories because they provide evidence for those theories. But theories do not in general explain any particular observational datum. How then can any particular datum or set of such be evidence for a theory, and if observations cannot be evidence for theories, how can they provide reason for accepting a theory? Say that a datum or set of data count as reason to accept or reject some theory provided it bears some particular relation to the theory, and call that relation the evidential relation. What, Bogen and Woodward ask, is the evidential relation? (2) While some data are counted as evidence for or against theories, other data are not even potential candidates for evidential status. Although these data appear to bear the same logical relationship to relevant theories as do other data, they are simply ignored as artifacts of the experimental design, as so much experimental noise. What property or properties of a datum or a data set, then, make it a candidate for evidential status, or, as I shall say, what are the qualifying properties a datum or data set must exhibit if it is to be a candidate for evidential status? (3) Are the evidential relation and the qualifying properties such that scientific evidence must necessarily be theory laden, i.e. be infected by the theories employed by the scientists collecting data, or inferring from data to theories? Erkenntnis 52: 29–37, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands. 30 BRUCE GLYMOUR Bogen and Woodward seek to answer these questions by distinguishing between data and phenomena. The two sorts of entities are held to differ in both their ontological and epistemological status. The two differ epistemically because phenomena are, while data are not, explained by theories. Hence phenomena can be direct evidence for theories, while data, at most, can be only indirect evidence for theories. Further, knowledge of data is non-inferentially justified, while knowledge of phenomena is inferentially justified. To say that a scientist is wrong about the data she reports is necessarily to say that she did not in fact see what she claims to have seen, while to say that a scientist is wrong about the phenomena she reports need only be to say that she has drawn incorrect inferences from what she indisputably did see. The two differ ontologically in that phenomena are stable, repeatable features of the natural world, while data are not. Phenomena are ineliminable, bedrock elements of the furniture of the world. Particular observations, however, are one time occurrences that result from accidental collocations of, and hence causal interactions between, a veritable legion of causes and conditions. Using the distinction, Bogen and Woodward claim that scientific inference works as follows. Data are gathered, and sorted as to their reliability. From patterns in reliable data, scientists infer to the existence of phenomena, which either are, or are not explained by theories. When one can infer from reliable data to a phenomenon or its absence, the claim that the phenomenon exists, or does not, counts as evidence for or against relevant scientific theories. Reliability, consequently, is the qualifying property that data must have if they are to be candidates for evidential status, though Bogen and Woodward do not say, precisely, what it is for a datum or data set to be reliable. The evidential relation between theory and data, insofar as there is one, is then composed of some inferential relation between data and phenomena (data are evidence for phenomena) and another between phenomena and theory (phenomena are evidence for theories). Again, Bogen and Woodward do not say what sorts of inferential relations count here. Thus, data are, at best, indirect evidence for a theory. Bogen and Woodward then further argue that phenomena are not simply theory-laden observations. Of course, if ‘observation’ or ‘perception’ are defined loosely enough, then phenomena will so count. But, according to Bogen and Woodward, any definition of these terms on which phenomena do so count is simply too loose to be informative. If observations are taken to be produced by distal causes which are, causally, relatively close to human perceptual experiences, i.e. bloges on screens or graph paper, clicks of a Geiger counter, flashes on photographs, the position of pointers on meters and so on, then claims about what phenomena exist are not simply DATA AND PHENOMENA: A DISTINCTION RECONSIDERED 31 theory laden observations. Rather they are claims justified by inferences from reliable data, i.e. reliable observations of bloges, clicks, flashes and positions. Such inferences can of course be more or less reliable, more or less cogent, quite independently of the reliability of the data on which they are based. And one way in which these inferences can go awry is that they assume one or another background theory which happens to be fallacious. While this makes inferences to phenomenal claims theory dependent, it does not make claims about the existence of phenomena necessarily theory laden in the relevant sense, for the sense in which inferences go awry here depends on the possibility of the inferred phenomenal claims being false even though the investigator believes them to be true. The existence or non-existence of any particular phenomenon cannot, then, be determined simply by what the investigator believes about the phenomenon. Phenomena are therefore objectively real, in some strong sense of those terms, though of course we may be mistaken about the claim that any particular phenomena does, or does not, exist. In a recent paper McAllister (1997) claims against Bogen and Woodward that while the distinction between data and phenomena is both useful and cogent, (a) Bogen and Woodward have not managed to draw that distinction in the right way, and (b) on the correct distinction, phenomena are themselves investigator relative features of the world. His case is this. Every data set can be understood as being composed of some signal, i.e. stable pattern, and a certain level of noise. Indeed, for any given (non-zero) level of noise, any data set will exhibit an infinite number of patterns. So any data set you please will exhibit an infinite number of patterns.1 McAllister then claims that no objective facts about the empirical evidence constrain which of the infinite patterns exhibited by a data set one ought to take to be phenomena. Since phenomena are supposed to be basic ontological features of the world, they cannot be infinite in number, so not all patterns can count as phenomena. The scientist must therefore privilege some but not all of the patterns as phenomena, and the choice of which to so privilege is unconstrained by the evidence. It then follows from McAllister’s account that the choice about which patterns to recognize as phenomena can only be made by the investigator on subjective grounds, grounds that presumably cannot but include her prior theoretical commitments. For McAllister, one important of these commitments is the level of noise the investigator is willing to tolerate, but this choice alone will not be enough to identify the phenomena among the plethora of patterns, since for any noise level there are an infinity of patterns exhibited by the data. Thus, phenomena can be identified only against 32 BRUCE GLYMOUR some commitment over and above one’s choice about how much noise to tolerate. Phenomena, therefore, cannot be objective features of the world: while their status qua pattern in the data is clearly objective, they are distinguished from other patterns as phenomena only by investigator relative commitments, commitments that are not themselves subject to objective criticism, since the recognition of any pattern whatsoever as a phenomenon requires that similarly non-empirical commitments be employed. McAllister apparently thinks the distinction between data and phenomena is an essential element in any clear philosophic account of scientific practice. And on the understanding of phenomena he defends, such entities are investigator relative, rather than objective, features of the world. Hence, on his account, the empirical support for any inference to a theory is necessarily theory laden. I am unconvinced. While I think McAllister has recognized a serious flaw in the distinction advanced by Bogen and Woodward, and that their account simply does not work very well, I argue that no such distinction between data and phenomena is needed, and that the distinction which is needed, and is already well established in the relevant literature, does not entail the sort of relativism required by McAllister’s version of the distinction between data and phenomena. Perhaps the place to begin is by making somewhat more precise the notion of ‘pattern’ employed both by Bogen and Woodward and by McAllister. Consider a standard, though certainly not the only, method for discovering causal relations among variables (cf. Spirtes et.al. 1993). One constructs a sample of data by measuring, in observational or experimental contexts, the joint distribution of values among these variables. One treats the sample as a sample from a population of data with a particular statistical structure, given by a probability density function on joint values for the variables. This function entails certain conditional independence relations among the variables. One then performs a double inference. One first calculates various sample statistics, and infers from the values of these sample statistics to a model of the population of data. From the population model and the conditional independencies entailed by it, one infers a causal model, or class of such, that accounts for the conditional independence relations in the population model. While not all statistical inferences have exactly this form (not all, for example, are inferences to causal structure), the distinction between sample and population structure is essential, and in particular statistical inferences always move from a claim about sample statistics to the inferred proposition, whatever this may be or be about. So while I do not know exactly what is meant by ‘pattern’ in Bogen and Woodward’s work, I take it their various accounts are meant to be general, DATA AND PHENOMENA: A DISTINCTION RECONSIDERED 33 and hence that by ‘pattern’ Bogen and Woodward and McAllister anyway ought to mean at least to include the values of sample statistics. The Bogen and Woodward version of the distinction between data and phenomena relies heavily on supposed differences in the epistemic status of data and phenomena. Claims about data are not subject to certain kinds of epistemological challenges that claims about phenomena are, but claims about phenomena are explicable in ways that claims about data are not, according to the distinction advanced by Bogen and Woodward. This supposed difference is illusory: certain entities have both the epistemically foundational status of data and are susceptible of explanation by theory in just the way phenomena are. If we are certain, at least in the relevant sense, of the observations comprising a data set, then the mean value of a variable in the data, or its variance, the shape of the distribution, correlations between variable values, and so on, are no less certain. So sample statistics have the same epistemic status as the observation reports comprising the data in the sample. But it is precisely this sort of statistical feature of data sets that are explained by scientific theories. While a correlation between variables A and B is no guarantee that there is a causal relation between the two, such a causal relation does explain an observed correlation between the two variables. Moreover, it is just such statistical features of distributions of data that are repeatable, and indeed it is these that one expects to recover if one repeats an experiment whose results one seeks to verify. Hence, data and phenomena are, minimally, not exclusive of one another. At the very least, then, the distinction Bogen and Woodward draw is not as sharp as it ought to be. Nothing I have said so far challenges either McAllister’s critique of Bogen and Woodward, or his preferred account of the distinction between data and phenomena. But unlike McAllister and Bogen and Woodward, I am not convinced that any such distinction is necessary, nor that the empirical warrant for inferences to theories is necessarily theory laden in the way that McAllister’s account suggests it must be. Suppose the scientific inferences of interest are statistical inferences with the structure suggested above. We can either take the distinction between data and phenomena to correspond exactly to the distinction between sample and population structure, or not. Suppose we take the distinction between data and phenomena to involve something over and above the distinction between sample and population structure. Then statistical inference procedures, and methodological justifications for them, will not require the distinction between data and phenomena, and hence the distinction will be unnecessary. Suppose we deny that the distinction between data and 34 BRUCE GLYMOUR phenomena involves something over and above the distinction between sample and population structure, i.e. we take the statistical structure of data samples, i.e. sample statistics, to correspond to data and population structure, i.e. population parameters, density functions and conditional independencies, to correspond to phenomena. Then the distinction between data and phenomenon simply gives a new name to a distinction which is already deeply embedded in the literature on statistical inference. The terminological reform is unnecessary and in some respects misleading, and hence should be avoided. Moreover, since on some statistical inference procedures, e.g. Bayesian scoring procedures, one infers directly from sample structure to theory, the distinction between data and phenomena will not play any essential role in these sorts of inferences or the justification of these inferential methods. This leaves only the question of whether statistical methods are necessarily subjectivist in the sense suggested by McAllister. Some methods clearly are, e.g. Bayesian methods when employed with subjectivist account of probability. Others, however, are not. If we assume, with Bogen and Woodward and McAllister, that the observation reports comprising the data are not themselves theory laden in the relevant sense, then inferences from the data to causal structure can be theory dependent in two ways. First, the inferences may rely on causal assumptions about which variables are causally connected, about the functional form of the connection (are the equations linear or quadratic?), and about the level of noise in the data (is the noise generated by probabilistic or indeterministic dependencies, or by some unmeasured deterministic cause?). So, for example, one might well assume that the value of variable 2 at t2 cannot exert a causal influence on the value of variable 1 at t1 if t1 is prior to t2. Not all such assumptions are so innocuous, and even innocuous assumptions may be mistaken. Hence inferences employing such assumptions are theory dependent. But this sort of theory dependence is essentially dissimilar from that required by McAllister’s version of the distinction between data and phenomena. There is, on his account, no objective reason to prefer recognizing one pattern among many as a phenomenon, and for this reason there is no possibility of offering a cogent, objective criticism of any given choice about which patterns are to be taken to correspond to phenomena. On his account such choices are arbitrary because they simply can be nothing other than arbitrary. Clearly, however, there can be cogent reasons, founded on objective empirical or conceptual resources, for objecting to the set of causal assumptions that underwrite any particular inference from data to causal structure (cf. Spirtes et.al. 1993). Hence the assumptions about physical theory that underwrite an inference from data to causal DATA AND PHENOMENA: A DISTINCTION RECONSIDERED 35 structure need not be, and often are not, arbitrary. Assumptions about physical theory on which statistical inferences depend are subject to cogent critique, whereas the assumptions that underwrite an inference from data to phenomena, on McAllister’s account, are not. Second, the causal structures to which one infers on the basis of a given set of data depend on the statistical methods one adopts. Not all methods are appropriate in all contexts, and there is an ongoing controversy about which methods are best used in which contexts (see, for example, Hellman 1997a and 1997b; Kelly et. al. 1997; Korb and Wallace 1997; Sprites et. al. 1993). A claim about the appropriateness of a given method in a given context is itself a theoretical claim, and so the choice of statistical method is itself a theoretical assumption of a sort. Hence there is this second way in which inferences to causal structure are theory dependent. But again, this theory dependence is quite different than the theory ladenness under which claims about phenomena suffer on McAllister’s account. First, the theories in question here are in general not physical theories, but rather mathematical theories. On some Quineian or Millian accounts, the difference in epistemological warrant for empirical and mathematical theory is illusory. On others, however, it is not. But second, even if one does not think there is some essential difference in the epistemological warrant we can have for mathematical as opposed to physical theories, the statistical methods we adopt are subject to criticism in a way that inferences to phenomena are not. There are well defined theories of reliability, on which the reliability of the various methods are assessable (cf. Kelly 1996). If one adopts, in a given context, a method which is unreliable in a given sense, then one cannot also endorse that notion of reliability. Those who both adopt the method and endorse the notion of reliability are committed to an inconsistency, and hence subject to cogent, objectively grounded criticism, as inferences to phenomena, on McAllister’s account, are not. The naive ontological distinctions between (1) observations of events, (2) the causes, conditions, and properties that produce the observed events, and (3) the natural kinds to which such events belong, are certainly cogent. More, the distinctions are essential if one is to clearly delimit the epistemological difficulties scientists confront. So too the conceptual distinctions between the variable values comprising a data set, the sample statistics characteristic of the data set, and population parameters characteristic of a population of data are essential for describing and justifying various methods for scientific inference. What is not necessary is a sophisticated distinction between data and phenomena on which these concepts play some essential role in describing the structure of scientific inferences, or in justifying inferences with that structure. 36 BRUCE GLYMOUR The work of Bogen and Woodward is interesting and for various historical reasons important (it was, for example, among the first work by serious analytic philosophers that was attentive to the intricacies of experimental science). But I think that McAllister is right in claiming that the central distinction between data and phenomena offered there is inadequate, though he and I have slightly different reasons for thinking this. Unlike McAllister, however, I do not see why the distinction need play any important, much less indispensable, role in our philosophic account of scientific practice. No such distinction is necessary because the inference from observation to causal theory or classification is invariably statistical. The description and justification of statistical inferences requires the above mentioned statistical concepts, but not those of data and phenomena. Further, if data points are not theory laden, then neither are sample statistics, and the latter are both theoretically explicable and replicable. Given a set of the former, the latter are constructable, and given the latter, non-theory laden inferences to both population models and causal models are possible. As a consequence, I do not see why the empirical basis for inferences to theories (phenomena for McAllister, sample statistics and/or population models for me) need be essentially subjectivist or theory-laden in the way McAllister’s account entails they must be. NOTES ∗ Thanks are owed to James McAllister for many helpful comments on previous versions of this paper. 1 Indeed, any finite data set is guaranteed to exhibit an infinite number of patterns even with zero noise. McAllister is not unaware of this familiar point; he simply takes some data sets, e.g. the continuous pen trace left by a seismograph, to be of infinite size. McAllister is exactly right that any such data set exhibits exactly one pattern with zero noise, but I think this is not a relevant data set. One does not often infer from any single such trace, but rather from a finite set of such traces. Unless the traces agree exactly about the values of each variable at each time, no deterministic relationship between measured variables is exhibited with zero noise, since for each pattern there will be at least one data point on at least one trace that is inconsistent with the pattern. The difficulty can be resolved in either of two ways. One can describe a deterministic pattern between measured variables and at least one unmeasured variable, provided the value of this variable differed during the experimental runs recorded by inconsistent traces in the data set. In this case, the data set exhibits an infinity of patterns with zero noise, since the data set includes values of the measured variables for only a finite set of values for the unmeasured variable. Differently, one can solve the difficulty by taking the relationship between measured variables to be essentially probabilistic. In this case no set of observations is logically inconsistent with any pattern whatsoever, and so again an infinity of patterns are exhibited with zero noise. DATA AND PHENOMENA: A DISTINCTION RECONSIDERED 37 REFERENCES Bogen, J. and J. Woodward: 1988, ‘Saving the Phenomena’, Philosophical Review 97, 303–352. Hellman, G.: 1997a, ‘Bayes and Beyond’, Philosophy of Science 64, 191–221. Hellman, G.: 1997b, ‘Responses to Maher, and to Kelly, Schulte and Juhl’, Philosophy of Science 64, 317–322. Kelly, K.: 1996, The Logic of Reliable Inquiry, Oxford University Press, Oxford. Kelly, K., O. Schulte and Cory Juhl: 1997, ‘Learning Theory and the Philosophy of Science’, Philosophy of Science 64, 306–316. Korb, K. and C. Wallace: 1997, ‘In Search of the Philosopher’s Stone: Remarks on Humphreys and Freedman’s Critique of Causal Discovery’, British Journal for the Philosophy of Science 48, 543–553. McAllister, J.: 1997, ‘Phenomena and Patterns in Data Sets’, Erkenntnis 47, 217–228. Spirtes, P., C. Glymour and R. Scheines: 1993, Causation, Prediction and Search, Springer-Verlag, New York. Department of Philosophy Kansas State University Manhattan, KS 66506 U.S.A.