Click here to the version.

-1- Does Ethics Require Evidence Based Treatment in the Mental Health Disciplines? (Draft) 1. Introduction The idea of evidence-based treatment in psychology and the other mental health fields is not new, and the enterprise is far enough along in both medicine and mental health to have developed orthodoxies which have, in turn, defined heresies (e.g. Davidoff 1999; Szatmari 2003). Some orthodox priests have taken positions to the effect that non evidence-based treatments should be banned (e.g., Lilienfield 2007), though so far there have been no calls for public burnings, either of books or practitioners. The orthodox position reflects the status evidence-based treatment has achieved in physical medicine. And although, perhaps embarrassingly, there is no evidence that evidence-based treatment improves medical care (Haynes 2002; Straus & McAlister 2000), I believe a majority of physicians and many laypersons would nevertheless agree that medicine has benefitted from its introduction. In 2007 evidence-based medicine was included among the best 15 ideas in medicine since the mid-nineteenth century (Morrison 2007), and it is a concept now often found in the popular press, which implies it has achieved the status of a truth no one would question. The April 2015 edition of The Atlantic, for instance, carried an article criticizing Alcoholics Anonymous because it is not an evidence-based treatment. Evidence-based medicine began in the early 1990s with the somewhat informal proposal that clinical decisions in medical practice should be consciously built on “current best evidence,” as opposed to reliance on habit, local preferences, or authority (Evidence-Based Medicine Working Group, 1992). Those who have tried since then to elaborate and develop the idea of evidence-based practice have proposed several hierarchies of what constitutes “best evidence.” Though these hierarchies have been altered a little over the years, and seem to be something of a work in progress, they all assume that evidence is the result of Newtonian “empirical observation,” and the only important questions are those that have to do with what sorts of empirical observation are most valid and reliable. The broad position currently is: “any empirical observation constitutes potential evidence, whether systematically collected or not” (Guyatt et al 2008 p.10), and even the “unsystematic observations of the individual clinician constitute one source of evidence . . .” Unsystematic observations, however, are not considered to be the best evidence, and evidence-based medicine proposes strength-of-evidence-hierarchies for different areas of medical practice. For instance, the hierarchy for prevention and treatment decisions in Guyatt et al (2008 p.11) looks like this: N of 1 randomized trials; Systematic reviews of randomized trials; Single randomized trials; -2- Systematic review of observational studies addressing patient-important outcomes; Single observational study addressing patient-important outcomes; Physiologic studies (of blood pressure, bone density, etc.); Unsystematic clinical observations. At the bottom of the hierarchy is the practitioner’s experience, and though it is still evidential, methods that better control the selection procedure for bias are preferred. Whatever area of medicine is being practiced, the idea is that clinical decisions should spring from the most unbiased and empirically based knowledge one can get hold of, with randomized trials given pride of place. In fact an emphasis on randomized trials has come to be synonymous with evidence-based treatment in much of the literature as if the rest of the evidence hierarchy did not exist. (For instance, Straus et al. 2005 suggests, “If a study wasn’t randomised, we suggest that you stop reading it . . .” p.118.) The emphasis on randomization in evidence-based medicine has led to at least one instance in which the behavior of medical researchers in insisting on a proper randomized trial was so mindless and mechanical as to appear negligent if not criminal. In the 1980s, years of clinical observations had suggested the utility of a new treatment for persistent pulmonary hypertension (PPHS), a condition whose 80% mortality rate in newborns was reduced to 20% by the new treatment approach now known by the acronym ECMO. Insisting on a proper randomized trial before accepting the new treatment, researchers exposed ten infants suffering PPHS in the control group to the likelihood of death, and in fact four of the ten died before the trial was stopped (Bartlett, et al. 1985). Certainly in this case randomization seems to have been more a fetish than a guide to truth (see Worrall’s 2002 discussion). There is a less discussed second principle behind evidence-based medicine in addition to the hierarchy of evidence principle. Evidence by itself is necessary but not sufficient for making clinical decisions. In addition, “Decision makers must always trade off the benefits and risks, inconvenience, and costs . . . and their patients’ values and preferences” (Guyatt et al 2008 p.10, authors’ emphasis). This second principle amounts to saying whatever evidence can be found will have to be understood and used in the context of the patient’s life. It is an intriguing question why this aspect of evidence-based medicine is largely ignored (e.g. in nearly all the rest of Guyatt et al.) and may even be unknown among many who throw the term about. It certainly seems to have been unknown to those who carried out the PPHS trial just discussed. It is just possible that power enters in. Even though the actual process of evidence-based medicine is in a virtuous flux and has been revised several times, and even though there is no evidence that evidence based-medicine is superior to whatever came before it, the word evidence is an authoritative and even a provocative term, and it is open to appropriation by any group with a competitive ax to grind. Those who use the term have grabbed quite a lot of power to say what is good and what is bad -3- in their area of professional practice (Gupta 2014). After all, who can be against something that is based on evidence? Authoritative or not, is an evidence-based approach fully applicable to those fields which study whatever it is we mean by the “mental,” including the use of psychoactive drugs and the psychotherapies? We are surely material beings, and so whatever we mean by mental will be material as well and, presumably, amenable to empirical study. On the other hand, we almost certainly do not understand matter as well as is commonly supposed and, therefore, may not know all the ways to study it. It may be a good idea to ask whether measurement can play the same role in the study of conscious experience (which Galen Strawson, 2010, has characterized as “the hard part of the mental”) as it can and must in the exact sciences. Possibly both of these questions boil down to whether it is desirable for psychology and psychiatry to ape the hard sciences. 2. What is evidence in science? If we are going to discuss evidence based treatment, we should begin by understanding the concept of evidence as it has come to be used in science. When we say that e is evidence for hypothesis h, we are saying that e gives “a good reason to believe” that h is true and, first of all, that h is more probable in light of e than without it, or: p(h/e) > p(h) (Hess 1974, p.134; Swineburne 1973, p.3). That, however, cannot be a complete definition. An increase in probability may (usually) be a component of evidence, but it is hardly sufficient–buying a lottery ticket is evidence I will win the lottery, but it is not, alas, very much evidence. Consequently, there is a modified version of the definition just given which is today the most widely used probabilistic definition. It is that something counts as evidence only if it increases an event’s likelihood beyond some high threshold or: p(h/e) > k, where k is a threshold of high probability, say, .05 but in any case better than even money (Achinstein 2001, p.24). This is not a complete definition either, however. There are, in fact, scenarios in which e does not increase the probability of h and is nonetheless e because h/e exceeds k and other scenarios in which e is evidence that h even though e actually reduces p (for several brain-teasing examples see Achinstein 1983, pp.329-330). Therefore another factor is needed beyond probability in order to arrive at a good reason to believe. Peter Achinstein (1983, 2001) suggests there must be an explanatory connection between e and h. If we add explanation to the equation, the definition of evidence becomes: e is a good reason to believe h if 1) p(h/e) > p(h) and 2) p(h/e) > k and also if 3) there is probably an explanatory connection between h and e. The key point is that an increase in probability alone is not sufficient for evidence, not even if the increase exceeds a high threshold. To be a good reason to believe, h must plausibly explain e, why e is true (see Note 1). -4- The reason h must explain e is that there is a difference in kind between data (evidence) on the one hand and the phenomenon (hypothesis) for which the data are being used as evidence on the other. Empirical data, which can be observed and/or measured, provide support for the existence of phenomena which, by contrast, are not directly observable, for instance, Spearman’s g (or general intelligence) or Skinner’s notion of operant conditioning (and both are different from overarching theories, which need not concern us here). The difference between observational data versus phenomena is that the former are variable and unstable, requiring many attempts (replications) at defining what exactly they mean while the latter are persistent over time, recurrent, and general (Bogen, 2010, 2011; Woodward, 2010, 2011; see also a review of Newton’s use of “phenomenon” in Achinstein 1991, pp.33-35). Given the variable nature of data, experimentation balances and idealizes data only with an extended number of replications (cf. Open Peer Commentary 2013). In any single experiment confounding variables, if present, will be unevenly distributed, and there will be no way to know if or how this has occurred in a particular trial. It is the overall body of data that eventually yields a good reason to believe a phenomenon. Or, “Publication of a new scientific finding should be viewed more as a promisory note than a final accounting” (Cacioppo & Cacioppo 2013, p.121). What if, as is usually the case, there is some degree of reasonableness in believing several hypotheses (viz. in several phenomena) in light of the data? Suppose a fair roulette wheel lands on a black number one third of the time in a given sequence of spins and not one half the time. In that case both of these statements are true: the probability the wheel will land on a black number is one half if previous events are discarded AND the probability of a black number is less than one half if previous events are included. In such a situation, something is going to have to be discarded and something else kept, and what is kept will depend on one’s interests, other available information, and perhaps funding or prospects for advancement. In such a case, though, what is included and what is disregarded are determinants of p or: Whether or not it is reasonable to believe e is evidence of h depends on what biases or relativizations or non specific control conditions are introduced and which are ruled out (Achinstein 1983, p.108ff). In that case, the formula for p should read something more like this: p(h/e in light of claims whose truth or irrelevancy is being assumed) = r where r is a measure of probability. Such a statement does not imply that the probability statement is meaningless or invalid but that it is likely to be incomplete as stated. The claim is not that there are no previous events or, for that matter, no overall context of factors that are assumed to be true or irrelevant, only that some aspects of a situation are being ignored in the interest of determining a probability. Yet some of what we ignore may be crucial. For instance, and unhappily, it appears there may be a flawed selection procedure spanning years in how the results of antidepressant trials are reported. Turner et al. (2008) examined 74 FDA-registered studies involving 12 different antidepressant drugs and over 12,500 participants. When they looked at those studies that were published, it appeared that 94% of this group of trials were positive, but the number of positive trials fell dramatically to 51% when all of these studies were examined without regard for publication (Turner et al. 2008). -5- This implies a pretty severe bias (toward positive findings) in the selection of what evidence is made public. Flaws in the selection procedure to this degree may mean that whatever is being presented as e is incomplete to the point of becoming meaningless or that selection is influenced by factors other than finding good reasons to believe e. This is to say that quite a lot rests on the selection procedure whereby data are brought forward as evidence. For example, the body of research captured in Turner et al (2008) cannot be regarded as evidence for the efficacy of antidepressant drugs since p(h/e) = k (but does not exceed it) where h states that antidepressant drugs are effective, e is the accumulated data of all 74 studies registered with the FDA in the period Turner et al surveyed, and k is even money. Or: flaw is a matter of degree, but the greater the flaw in the selection procedure (SP) the closer p(h/e + SP) = p(h) or the less e is a good reason to believe h. So where are we? Evidence e is a good reason to believe in the existence of a phenomenon or hypothesis h if there is an explanatory connection between h and e and if e increases the probability of h past a certain high threshold and when e is a collection (or collections) of variable but observable data and h is a comparatively persistent phenomenon. However, whether the data constitute a good reason to believe depends largely on selection procedures, what is left in versus what is left out when the data are considered. 3. Statistical power, selection procedures, and randomization Null hypothesis significance testing (NHST) has been the “mainstay” for research into what does and does not work in psychotherapy and psychiatry (Haig 2014, p.72, also Laudan 1981). NHST champions randomization, which, as we have seen, is at the top of the evidence hierarchy. When we note that randomization is evidence-based treatment’s chief (some would say, only) contribution, we see the importance of NHST and especially its methods for managing SP. In spite of its widespread use, NHST faces formidable problems–some practical, some theoretical–beginning with statistical power. The statistical power of a study is its ability to avoid Type II errors (in everyday language, thinking there is no good reason to believe something when in fact there is). In general the bigger the number of subjects in a study, the higher its power. If the power of a study is sufficiently high, the study will be able to pick up evidence for a hypothesis even if its effect on the overall data is small. A statistical power of .8 is generally regarded as high enough most of the time, which means that out of ten so-called true hypotheses the study will find good reasons to believe eight of them. Let’s see what that might mean by drawing on a particularly accessible example first published in the October 19, 2013 issue of The Economist. Imagine that a study has set its alpha at .05, and so only one result out of 20 will be falsely identified as true (Type I error) when it is not. Let us also say the power of the study is a hefty .8, and the study tests 1000 hypotheses of which -6- there is, potentially at least, good reason to believe only 100. A power of .8 means that 80 of the true hypotheses will be found, and 20 won’t. In addition, of the 900 false hypotheses five percent or 45 will be incorrectly identified as true. Thus the study will produce 125 results which it identifies as true, or supported by evidence to such an extent it is reasonable to believe them. In this case, however, over one third are in fact false, which means that even in some studies with high power, only about two thirds of the results identified as true may actually be so. While this might be surprising, the overall body of data nonetheless meets a definition of evidence, that p(h/e) > k, assuming, of course, an explanatory connection between h and e and assuming that k is no greater than two out of three. What if the power is less than .8? Researchers have long complained that statistical power in psychology is usually between .3 and .5 (Button et al. 2013; Cohen 1990) or that “the typical power in our field will average around .35 (Bakker, et al. 2012, p.544). If we apply the last number to the example just discussed, it means we will only pick up 35 of the true hypotheses, or fewer than those resulting from Type I error, which is still 45. The study’s results will say that 80 hypotheses were found to be supported by strong evidence, but less than half of what the study claims to have found will be valid. It would seem that in low power studies the positive results are much less reliable than the negative results and even positive results from studies with high power might not be as persuasive as they seem at first. Unhappily, negative findings are rarely made public or find their way into publication (the results of the Turner et al. 2008 study mentioned earlier provide an illustrative example). In any case, it is clear that for studies with low power the accumulated results will not meet any acceptable definition of evidence, since p(h/e) < k. Since there are statistical methods (for example, Bayesian) which have been put forward as solutions to this problem, we might assume it must not be too much of an issue in current research. However, Bakker and Wischert (2011) found that just slightly better than one in ten papers in psychology with null hypothesis significance testing even discuss statistical power as a factor in their choice of sample size or research design. One might argue that most research in psychology is underpowered to the point of making reported findings highly questionable. If the power is even .5 the chance of finding a valid positive result is no better than a coin toss and even less than that if Type I errors are added to the equation. While statistical power is not the only issue plaguing NHST, it is one reason there have long been calls for abandoning that research strategy altogether (see discussions in Haig 2014; Worrall 2002). As noted earlier, the chief argument in favor of significance testing is that it requires randomization and therefore controls for bias. Statisticians have long maintained both that significance testing can be guaranteed solely by randomization (Beyar 1976; Fisher 1947 p.19) and that randomization can cure all ills. Some of the claims–e.g., that randomization protects against “ALL other factors, even those no one suspects” (Giere 1979, p.296)–are of course silly, but a more reasonable argument is that randomization tends to control or will likely control for confounding variables (again, see the discussion in Worrall 2002). This is the reason evidence-based practice -7- strongly endorses randomization, why RCTs are at the top of the evidence hierarchy, and why significance testing is still the dominant form of research in psychology and psychiatry. But does randomization (or its handmaid, blinding) actually manage bias, which is the only argument for preferring it above other methods of research? Let us conceptualize this problem as one of internal validity or whether what is taken to be evidence is indeed causally related to the hypothesis under study–viz. whether there are plausible alternative explanations for what is being proposed as evidence (Cook & Campbell 1979, p.50). This is the traditional problem that confronts the explanatory hypothesis definition of evidence, that there are often multiple hypotheses which explain the data (Newton 1946; Achinstein 1991, pp.31-67). And this is the issue we saw earlier when we noted there is usually some degree of reasonableness in believing more than one hypothesis, that something must be discarded from the argument (experiment) while something else will be kept, and if the selection procedure by which this is determined is flawed, the results are likely to be meaningless. But if the selection process can be managed so that excessive bias is not a factor, then randomization/blinding and significance testing may deserve their status. There are reasons to be doubtful. Randomization is difficult to accomplish in the sorts of field studies required in psychiatry and psychology (for example, Cook & Campbell 1979; Shadish et al. 2002). Though randomization can be used in such studies, there are very many things that can go wrong with the control group which might well lie behind any apparent treatment differences from the experimental group. These include: imitation of the experimental group, demoralization or its opposite (the “John Henry effect”), growing reluctance to tolerate the inequalities of benefits available to the experimental group, and differential mortality. It is also hard to imagine that subjects in psychotherapy research are actually blind if informed consent has been done properly (the therapists, of course, will not be blind, and neither will be the researchers if they pay attention). Similar difficulties and other confounding variables may lie behind several well publicized failures in recent years to duplicate medical findings (see Begley & Ellis 2012; Prinz et al. 2011). Yet there are true believers. For example, Mohr et al. (2009) state–somewhat blithely, it seems to me–careful design of the control condition can remove unwanted variables in randomized clinical trials and allow researchers to rule out any alternative explanation in the event of a significant finding. However, some of what they propose amounts to making interventions or treatments so mechanical and routinized (“manualized”) that what happens in the study may not resemble treatment as it occurs in the real world any more than a chess game that is plotted beyond the first ten or so moves would resemble tournament chess. Manualization may protect internal validity, but it obliterates its external cousin. Indeed, there is already reason to believe that researchers have built straw men when designing the control group for comparison with their preferred form of treatment (see the findings of Elliott et al. 2013 and also Spielman, Pasek, & McCall 2007). -8- In addition, although Mohr at al. endorse using adequate statistical power so that non-specific control conditions (relativizations) are washed out, this would require sample sizes which have historically been problematic for many researchers. Further, most therapy studies are never replicated (Bakker et al. 2013), although, as discussed earlier, data are only useful if enough trials are carried out to neutralize any confounding variables that might be present. And if randomization were as possible as Mohr at al argue, I doubt we would have the frankly bizarre situation famously suggested by the Brody et al. (2011) commentary on subject recruitment which raises the specter of “professional subjects” in psychotherapy and psychiatry research. Indeed, half a century ago Tullock (1959) was questioning the possibility of drawing random samples from specified populations and was therefore skeptical of significance testing. My point is that what are called field studies are likely to have fatal problems with internal validity, selection procedures, and whether a p statement actually describes a meaningful relationship between e and h in spite of the supposed benefits of randomization/blinding. What do these questions do to the notion of evidence? We recall that it all comes down to (or begins with) selection procedures when it comes to a good reason to believe something. In the discussion just finished I have made references to problems with: low statistical power and even whether power is taken into account when planning research; difficulty duplicating findings and even attempting to replicate previous studies; publication bias and reluctance to publish negative findings; so-called professional subjects in field research; distorting rival forms of treatment used in control groups; and whether randomization and its benefits are truly available in research outside a highly controlled setting. At the moment there is reason to doubt that research in the mental health disciplines very often offers us something that could be called evidence, particularly when it depends on NHST. What about research into the somatic interventions, particularly psychoactive drugs? Psychiatry appears to me to have had the good (marketing) sense to have claimed the authoritative and perhaps talisman-like phrase “evidence-based” before its competitors (though I believe CBT practitioners have caught on). Though it may well have copyrighted the term by now, is psychiatry doing a better job than psychology at producing something that could properly be called evidence? Marcia Angell, a former editor of The New England Journal of Medicine has offered a discouraging answer. She reviewed a number of biases in medical research in general and psychiatry in particular and concluded (2009), “The problems I’ve discussed are not limited to psychiatry, although they reach their most florid form there . . . It is simply no longer possible to believe much of the clinical research that is published . . .” Two years later Angell produced a more pointed assessment of psychiatry. She began by noting that since psychiatry has no known identifiable physiological basis for the disorders it treats, it cannot reliably differentiate health from pathology. Consequently, “industry sponsors” can create new diagnostic categories to develop new markets (2011), a sort of pathology creep to which anyone who has watched commercial television can testify. The basic argument is that psychiatry has been -9- gullible and so-called evidence-based treatment has been complicit in allowing bias into what is presented as evidence. would she say such a thing? Why The two types of bias or flawed selection procedures (SP) most discussed are those related to funding and publication (see Gupta 2014 pp.47-53 for much of what follows). If I simply note that the largest funding source for clinical trials in psychiatry is the pharmaceutical industry (Perlis et al. 2005), I may not have to say much more, since it is obvious that publically traded pharmaceutical companies are obliged to produce profits for shareholders and, so, will study drugs and procedures with maximum commercial value and use research methods that lead to the most unequivocal outcomes. For instance, most trials focus on a drug versus placebo, while clinicians and patients might have preferred to see the new drug tested against one already on the market. In addition, to keep the results clear, trials will use whenever possible subjects suffering only the disorder being addressed by the new drug, though such an uncomplicated case is rare in actual clinical practice. Finally, follow up tends to be brief, usually only weeks, which is problematic for clinicians whose patients have sometimes (or often) improved with a medication briefly but then lost their gains or whose patients who developed health-related problems years later instead of weeks (e.g. Type II diabetes in many taking second generation antipsychotics and cases of suicidality associated with SSRI therapy). The question then becomes: Can any RCT be regarded as evidence with these limitations? As I have already noted, flaws in SP are a matter of degree, but these flaws seem significant to me, since clinicians rarely see a clean clinical presentation and need to know if the drug’s effects last and if they are superior to what else is available. Thus, the SP which is (understandably) used by the drug industry may lead to something that might possibly be called evidence but to such a restricted degree that it is clinically irrelevant or, better, evidence for a different phenomenon (h) than the one that was supposedly being addressed. For instance, some big pharma data might be e for this h: “This particular drug is better than placebo and harmless in uncomplicated cases for brief times”–though one suspects the study had wanted to say a bit more than this. The second type of SP-related flaw, publication bias, refers to the apparent preference given in medical journals to positive or statistically significant findings, thereby artificially reducing what is presented as evidence. Because of this bias, researchers who must think about their careers or their funding source may neglect areas for study they feel are likely to produce negative findings or dump studies that are yielding negative results, particularly if those results are being paid for commercially. (According to Lexchin et al. 2003, data generated via commercial funding are more likely to show the effectiveness of the sponsored intervention than data generated by non-commercially funded researchers for the same intervention.) Thus, what is submitted for publication may already be influenced by factors that exclude large chunks of relevant data from ever entering the discussion. Moreover, it would hardly be surprising if the data from commercially sponsored trials showing a drug is ineffective or harmful are never even submitted for publication. -10- These sorts of biases may mislead clinicians to believe a drug or procedure is superior or safe because quite a bit has been left out of what becomes public. Can such results be evidence? No. Even if the data come from RCTs and can claim to have protected internal validity via randomization, what is missing is the explanatory connection between e and h, since, as with cases of funding bias, the phenomenon (h) that is being studied is not the one for which the data claim to be e. For instance, the body of data on antidepressant trials surveyed by Turner et al. (2008) could be e for this h: “Antidepressant medication significantly reduces depressive symptoms in most studies that are published.” The problem, of course, is that this is not at all what the published studies claim to be evidence of, and so the explanatory connection between h and e is broken. In this case the formula becomes: p(h/e + SP) > k where SP is the background information “in the studies that are published.” But, as we have seen, when SP is expanded (i.e., when all studies are considered), p(h/e) = k, and so, first, the body of data is not evidence and, second, the selection procedure proves to be the crucial factor, not e. The point is that there is less than meets the eye in most research into the psychotherapies and in psychiatry. The evidence hierarchy is basically a hierarchy of internal validity (LaCaze 2009). Evidence-based medicine privileges the data from RCTs over other forms of evidence because RCTs control or isolate the intervention being studied by hiding allocation to the control or experimental groups, thereby insuring internal validity. Those who endorse randomization in the context of evidence-based treatment in psychology and psychiatry appear naively ignorant of selection procedure problems attending randomization in these fields which effectively destroy internal validity. For that matter it appears an unfounded and perhaps absurd faith that RCTs offer better e than observational studies, which are much lower in the evidence hierarchy. Even though there is no good reason to believe evidence-based medicine’s claim that observational studies exaggerate treatment effects while RCTs do not, the prejudice in favor of RCTs is built into evidence-based medicine (see Benson & Hartz 2000; Concato et al. 2000; Worrall 2002). But if the prejudice is unfounded–as it seems to be–and if RCTs have not been able in actual practice to manage SP and have thereby failed to safeguard internal validity, then the evidence hierarchy has no validity, and the way evidence-based treatment has been understood and practiced is, at best, no different from what preceded it. 4. The visibility of things seen: Problems with external validity To this point I have argued that randomization as the guarantor of internal validity is not what it is cracked up to be, and I have argued that evidence-based medicine’s chief contribution–the evidence hierarchy–is invalid, at least in our field. My emphasis has been on sources of bias in selection procedures found in RCTs in clinical psychology and psychiatry, and I have argued these are serious enough to justify the doubt that most current data in mental health can be regarded as a good reason to believe anything very significant if they are a good reason to believe anything at all. -11- There is another source of bias which Gupta (2014) has termed technical bias. Technical bias refers to the limitations in what can be seen when only quantifiable data are allowed. Since RCTs require measurable data, they are not useful when arguments cannot be reduced to mathematical form. Yet many significant arguments in the history of science have not been statistically based or mathematically stated. For instance, Isaac Newton argued against the wave theory of light by noting that light does not diffract into the shadows even though waves of other sorts bend as they pass an obstruction (Newton 1979, pp.370-371). This argument is based on empirical observation, and it would be awkward to say it is not Newtonian science. Newton and other particle theorists certainly thought it was e that not-h, where h was the wave theory of light (cf. Achinstein 1991). This historically important argument is not, however, quantifiable and would likely be classified as a “single observational study” on the evidence hierarchy. This brings us to the problem of external validity (viz. whether the results of a study can be applied to patients outside the study). Too little can be seen when the only lens is mathematical, and, worse, factors other than those that are scientific will inevitably start to drive research and limit even further what can be seen. Since whatever cannot be measured will automatically be marginalized by the evidence hierarchy and will appear inferior to what can be expressed statistically, there will be considerable political and economic power to those dealing in measurable events. Psychotherapy is a case in point. The talking therapies have tried to avoid being a casualty in such an environment and offer a clear example of problems with external validity. In their efforts to get under the tent psychotherapists have found ways to measure outcomes and have discovered, to their relief no doubt, that most every psychotherapy works, often better and longer than drugs. As a consequence, however, we have the bizarre situation of the “Dodo verdict” (Stiles et al. 2015, p.283) or the conclusion that obviously different therapies produce equivalent positive outcomes, suggesting that all mainstream psychotherapies are almost equally effective (Lambert 2013). This is surprising, since it is plain that therapists from different camps are not all simply doing the same general things, which has been the usual, if naive, explanation (see Stiles et al. 2015 for a summary of relevant research). The question immediately becomes: What exactly is being measured here? The absurd answer usually given to this question is that all psychotherapies must be the same general phenomenon or, what amounts to the same thing, expressions of one or more common factors. The argument goes like this: 1) If all psychotherapies are equivalent phenomena (if they are essentially the same cause), 2) then all will have equivalent measurable outcomes (then they will have essentially the same effects). -12- 3) They do have equivalent measurable outcomes. 4) Therefore they are equivalent phenomena. Though widely accepted, this argument is surely an instance of the logical fallacy of affirming the consequent (if a, then b; b, therefore a), which is to say, an absurdity (which, perhaps not coincidentally, resembles the thought disorder known as predicate thinking). Things get no better if we reverse 1) and 2) to make the argument read: 1) If all psychotherapies have equivalent outcomes, 2) then they are equivalent phenomena. This does not work because there is no obvious explanatory connection between 1) and 2), since there are many reasons besides identity or equivalency why two entirely different phenomena (e.g., a loaded gun and an aortic aneurism) might have the same effect (e.g. the cost of a funeral). One reason different phenomena–in this case different forms of psychotherapy–might have the same result is that they are being studied at a superficial level that cannot detect actual and complex effects. Thus the by-now commonly assumed equivalency of plainly different therapies is the mental health version of Hegel’s night in which all cows are black. The herd likely looks different by day, and the psychotherapies might look different too without evidence-based medicine’s technical bias. But we will not know as long as randomization is held in fetishistic reverence, since the way therapy is done in the real world differs from what can be achieved experimentally, where treatments are likely to be much shorter and less complex (Elliott et al. 2013; Spielman et al. 2007; Gupta 2014) and where, as noted earlier, effective blinding is unlikely because of informed consent. The thing simply cannot be seen clearly through a mathematical lens. There is a broader problem with external validity, which affects both psychotherapy and somatically based interventions, which is that all mental health fields work with diagnostic categories that do not appear to be naturally occurring entities with clear definition and prognosis. Both a lack of definition and the absence of a predictable prognosis are problems for research into various diagnoses. These problems were thought to be serious enough that the NIMH in 2013 stopped using DSM-IVR categories in its funding decisions. The rationale was that DSM diagnoses are built around groups of symptoms agreed upon by (some) clinicians, not on objective (laboratory) data with obvious and reliable parameters. Regarding the absence of clear definition in diagnostic categories, for instance, Gupta (2014, p.100) makes the interesting observation that there are 70 possible symptom combinations that would justify a DSM-IVR diagnosis of major depression and 93 possible combinations of all listed symptoms. An RCT studying some intervention with depressed patients might therefore be comparing subjects who have little in common beyond their scores on a self-report inventory. It is questionable that all of these people are experiencing the same phenomenon (viz., suffering the same disorder), and several rather different phenomena may be confused as “depression,” which is not a homogeneous entity any more than every wine is French. This situation will not surprise clinicians who have long noted that depression is a very different animal in those who hate themselves (classic depression) versus those who hate the world around them (depression from a paranoid position) and -13- different as well in patients who are emotionally empty (schizoid depression), while all three groups differ from the deflation seen in narcissistic individuals which is often misdiagnosed as depression. All of these might be lumped together in the same study as “depressed” even though they will not all respond to the same interventions. A related problem is that since there are no reliable physiological markers (or what in medicine might be called “mechanisms,” LaCaze 2011) in mental illness, diagnosis and treatment and related research are open to manipulation (Angell 2011). By now everyone must be aware of the explosive growth in bipolar diagnosis over the past 20 years, a phenomenon so ludicrous that it appears to have induced DSM-V to redefine and tighten diagnostic criteria, presumably in an attempt to arrest over-diagnosis at the same time television commercials are busily promoting it. My unsystematic clinical observations suggest that very many of those who received this diagnosis and the cocktail of drugs typically used to treat it had only two things in common really, that they were angry people and that they lacked age appropriate self-management skills. When a bipolar diagnosis and the drugs usually given to those receiving the diagnosis are given to children and adolescents with these traits one suspects the purpose is political–viz. controlling a troublesome person–far more than it has anything to do with an individual’s becoming able to love and to work (hence, the emphasis in DSM-V on the episodic nature of symptoms as opposed to those children and teens who are almost always mad and poorly controlled). The second issue is whether diagnostic categories used in the mental health fields have any prognostic weight, whether there is an expected course the syndrome will follow if untreated. In physical medicine, most disorders have predictable trajectories, which makes it possible to tell if an intervention has had an impact. This is not the case in our field. Browsing through DSM-IVR or DSM-V one notes that the shortest sections in the discussion of many syndromes are those titled “Prognosis” and “Development and Course.” Moreover, what little is said in these sections has so many exceptions and variations that the reader might conclude there is no typical sequence for the disorder. A quantitative research approach presupposes prognostic homogeneity among groups being studied, but the absence of a known and reliable course to most mental illnesses means there is no norm against which to measure the effect of an intervention (cf. Gupta 2014, pp.95-104). External validity is elusive under such constraints. Even if a trial measures the effects of one intervention against another rather than the impact of the intervention on the disorder itself, there is no good way to tell what exactly was the object of either intervention or even if they had the same object if the natural course of the untreated disorder is unknown. With such questions about external validity, I do not see how research results can be regarded as evidence. One issue affected by external validity is the explanatory connection between h (the phenomenon being studied) and e (the data). If it is not clear precisely what phenomenon is being studied, then it will not be clear what the data are supposedly evidence of. If this is so, there can be no valid explanatory connection, which is part of what constitutes the nature of evidence. The preceding discussion also opens the possibility that the data could be equally well explained by multiple hypotheses, and if this is the case the probability of any one of the possible hypotheses in light of the evidence will not exceed the required high -14threshold and will not, therefore, be evidence. Until problems with external validity are solved, the idea of evidence-based treatment in mental health is folly. It is not clear what exactly is being studied in research on psychotherapy, but the absurdity in concluding that different treatments all work equally well is a strong hint that we have not yet found the right lens. Moreover, diagnostic categories lack coherence to such an extent we cannot possibly know whether all subjects in a trial are experiencing the same thing, nor can we determine whether changes to participants would have happened anyway, since prognosis is extremely fluid. I am not sure how the notion of evidence can be defended in this environment. 5. Measurement We are in this mess because of Galileo. Galileo did not get in trouble with Rome for, as is usually thought, saying the earth revolves around the sun; he did not even want to publish De Revolutionibus Orbium and only did so at the insistence of influential Churchmen. Galileo’s fault in the eyes of Rome was to redefine the meaning of truth. Before Copernicus, Galileo, and Kepler, truth was not taken to be measurable; what can be measured is that which is continually changing, coming into being, and then passing away, in contrast to truth, which was held to come only from proof. Scientific knowledge, therefore, before Galileo, was knowledge of theorems of a system from which everything else is deduced (Philipse 1995, p.289). It was, of course, known that some events–especially astronomical ones–could be measured and reduced to geometric formulae, but these mathematized versions of phenomenal events were understood as mere conventions whereby phenomena could be summarized or accounted for, not as truth. The phrase Simplicius came up with in the sixth century to describe this understanding of measurement was that mathematical formulae “saved the appearances.” Galileo’s radical departure was to suggest that what had previously been taken to be mere hypotheses and conventions were in fact ultimate truth, that a formula which saved the appearances was the truth (Barfield 1957). This liberation of the role of measurement was the beginning of modern science. Galileo’s contribution was to see the world itself as an idealized, mathematical manifold that could be measured (Russell 2007, p.184). There is a subtle confounding in this view, however. The mathematical world is different from the sensory, phenomenal, and interpersonal world. When we talk and think about the latter, we use what Husserl (1962, pp.190-191) called morphological essences, or the ideas of everyday things, like: the dry fly, wine, a good steak, and pear-shaped object. But mathematics traffics in ideal or exact essences, limiting concepts which actual things can only approximate and never achieve: a straight line, a perfect circle, or true justice. Exact essences are mathematical or true by definition, but morphological essences are not. Galileo saw the world as built of exact essences and therefore capable of mathematical manipulation and measurement, even though he was speaking of the sensory world or the world of human experience, which most of us would have put in the morphological camp. -15- Following this start, science has worked with great success to describe phenomena in progressively more exact terms so they can be measured, analyzed statistically, and manipulated in uniform ways. This entails a leap from the morphological (the everyday, the typical) to the exact, a leap which Husserl called “idealization” (1970, pp.24-28). This sort of leap can never fully capture the phenomenon being studied, because the phenomenon itself is not exact. There are no perfect circles in phenomenal reality nor true justice in the courts. The world we experience has “nothing of geometrical idealities, no geometrical space nor mathematical time with all their shapes” (p.50). In practical terms, every mathematical equation or description substitutes an ideality for a reality, an exact idea for a morphological one. When we consider the common example of reducing pain to a one-to-ten scale, we can see that a very great deal of information and detail is sacrificed in this process of idealization, but that is unavoidable in order to achieve an exact science. The world of science is not what is given to first-person experience. As mentioned, the latter is too variable and idiosyncratic to be scientific. In daily life we are not only aware of what we experience and how that can change, sometimes quickly, but also that others have their own experiences, which differ somewhat from ours. From this discovery we do not then assume that there are many worlds. Rather, we believe “in the world, whose things only appear to us differently but are the same” (Husserl 1970, p.23). Science comes from trading off the contingencies of the actual object or experience for a description which will allow mathematical manipulation. The sciences do not, therefore, offer an exact description but, rather, a description using exact terms (Russell 2007, p.186). Scientifically, the world of first-person experience is nothing more than a veil of appearances behind which the objective world available to science lies hidden; the “true world” must therefore be understood to be the world of science, increasingly separated from the world of first-person experience (Russell 2007, p.191; also Philipse 1995, p.296). This is precisely the reverse of what is the case, though. The mathematical world of scientific idealities has been substituted for “the only real world, the one that is actually given through perception that is ever experienced and experienceable–our everyday life-world” (Husserl 1970, pp.48-49). When life as it is lived gets idealized or compressed into a mathematically manipulatible form, too much is lost to be accurate any longer. The phenomenon is no longer what we experience and becomes caricatured and distorted. This is the case with psychotherapy, with what it means to be depressed or paranoid, and, quite possibly, with how psychoactive drugs affect us. Such events can be studied to a point by reducing them to an idealized form, though only to a point. What can be achieved is a general and undetailed version of the thing, but some other form of study will be needed to take us further. Psychotherapy or depression or the effects of Prozac call for some form of study that allows them to retain their complexity and will allow subjects who later read accounts of these phenomena to say that, yes, the account is much how it was for them. This is never more true than when we are trying to study mental illness, a phenomenon that cannot ever be fully grasped by reducing it to language available to the public domain. The process of mental illness is one which takes the sufferer -16progressively away from what Merleau Ponty called the common property world, which means that what belongs to mental illness is even less reducible (to third-person terms) than most human experience–which is scarcely exact or specific even when it is not troubled. Experience more resembles a pebble thrown into water, sending out rings and waves than it resembles a geometric point. Experience and the language we use to express and refine it tend to wobble (quite) a bit, largely because even a single thing is not only as we experience it, but there is always more there than what we see (Husserl 1977, p.45). The same words and, for that matter, the same experiences are not reliably the same from time to time or situation to situation. It can require time and shared experiences, patience, and quite a bit of intuition and discipline to grasp in detail what another person struggles with. That is, of course, the world psychotherapy, assessment, and the somatic treatments try to inhabit. These will always be messy processes, and they will never be reducible to a satisfactory degree to something measurable because human experience is by nature indefinite. But then again, experience does not have to be reduced to public domain language in order to be communicated. Conscious experience is available to us when we want to speak about it. Even the most private of experiences–physical pain–can be spoken of meaningfully, and we know what the other person means when he or she refers to it. For that matter, newborn infants know their mothers’ moods as soon as they step into the room. How hard can it be? 6. The question of ethics At this point, it is not at all clear what evidence-based practice could possibly mean in the mental health fields. The short answer, then, to the question of ethics is that evidence-based treatment is not ethically required because it does not exist. There is a larger issue, though. My impression on the basis of what I have written and found in the process of researching this paper is that evidence-based treatment is a corrupted field and that it is unethical to apply so-called evidence-based methods to clinical practice. The drumbeat for evidence-based practice is either a power play, mindless conformity, or a scuttling toward safety. The Ethical Standards by which psychologists operate–and these must be very like those in other mental health disciplines–essentially require that we know what we are doing and not use discredited methods. Evidence-based treatment lacks internal and external validity and therefore lacks credibility. Relying on it is much the same as insisting on something for either political reasons or for no reasons at all. We need to rethink this whole affair and start over. Notes -171. This definition of evidence puts all of the emphasis on the status of e and not on the truth or even the reasonableness of h. That is the thrust of John Stuart Mill’s (1959, chapter 18) position that probability is a measure of rational credibility when the conditions under which it is reasonable to believe are in dispute. This position entails that a probability statement indicates the degree of reasonableness of believing a hypothesis, not the degree to which the hypothesis itself is reasonable. If probability meant the latter, this would indicate we already had some degree of belief in the hypothesis which in turn means we have a belief in the hypothesis, and the so-called evidence would be superfluous because it adds nothing or is at best “slight evidence.” The point is that reasonableness of belief is a threshold concept. Therefore e cannot be evidence of a hypothesis we have already decided it is reasonable to believe. Rather, e can only tell us the degree of that reasonableness, which is not a threshold concept. If we translate Mill’s argument into the language found in null hypothesis significance testing (NHST), we can notice a common error in the interpretation of research results. His position would mean this: The p-value set by the researcher (e.g. .05) tells the likelihood of obtaining the same or similar data if the null hypothesis (of no difference) is true. The p-value represents how reasonable it is to take e as evidence, in other words, and says nothing about the credibility of the hypothesis being studied. Unfortunately, p-value is commonly misconstrued as the probability the researcher’s hypothesis is true (in this case a hopeful 95%). -18References Achinstein, P. (2001). The book of evidence. New York: Oxford University Press. Achinstein, P. (1991). Particles and waves: Historical essays in the philosophy of science. New York: Oxford University Press. Achinstein, P. (1983). The nature of explanation. New York: Oxford University Press. Angell, M. (2009). Drug companies and doctors: A story of corruption.” The New York Review of Books, 56: January 15. Angell, M. (2011). The epidemic of mental illness: Why? The New York Review of Books, 58: June 23. Bakker, M., Cramer, A., Matzke, D., Kietvit, R., van der Maas, H., Wagenmakers, E., & Borsboom, D. (2013). Dwelling on the past. European Journal of Personality, 27:120-121. Bakker, M., van Dijk, A., Wicherts, J. (2012). The rules of the game called psychological science, Perspectives on psychological science 7:543-554. Bakker, M. & Wischerts, J. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods, 43: 666-678. Barfield, O (1957). Saving the appearances: A study in idolatry. London: Faber & Faber. Bartlett, R., Roloff, D., & Andrews A. (1985). Extracoporeal circulation in neonatal respiratory failure: A prospective randomized study. Pediatrics 76:479-487. Begley, C. & Ellis, L. (2012). Raise standards for preclinical cancer research. Nature, 483:531-533. Benson, K. & Hartz, A. (2000). A comparison of observational studies and randomized controlled trials. New England Journal of Medicine, 342:1878-1886. Bogen, J. (2010). Noise in the world. Philosophy of Science 77:778-791. Bogen, J. (2011). “Saving the phenomena” and saving the phenomena. Synthese 182:7-22. Brody, B., Leon, A., & Kocsis, J. (2011). Antidepressant clinical trials and subject recruitment: Just who are symptomatic volunteers? American Journal of Psychiatry, 168:1245-1247. Button, K., Ioannidis, J., Mokrysz, C., Nosek, B., Flint, J., Robinson, E., & Munafo, M. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14:365-376. Beyar, D. 1976. Randomized clinical trials: Perspectives on some recent ideas. New England Journal of Medicine 295:74-80. Cacioppo, J. & Cacioppo, S. (2013). Minimal replicability, generalizability, and scientific advances in psychological science. European Journal of Personality, 27:121-122. Cohen, J. (1990). The earth is round (p<.05). American Psychologist 49:997-1003. Concato, J., Shah, N., & Horowitz, R. (2000). Randomized controlled trials, observational studies, and the hierarchy of research designs. New England Journal of Medicine, 342:1887-1892. Cook, T. & Campbell, D. (1979). Quasi-experimental design & analysis issues for field settings. Chicago: Rand McNally. Davidoff, F. (1999). In the teeth of evidence: the curious case of evidence-based medicine. Mount Sinai Journal of Medicine, 66:75-83. -19Elliott, R., Watson, J., Greenberg, L., Timulak, L., & Freire, E. (2013). Research on humanistic-experiential psychotherapies. In M. Lambert (Ed.), Bergin & Garfield’s handbook of psychotherapy and behavior change (6th ed., pp.495-538). New York: Wiley & Sons. Evidence-based medicine working group (1992). Evidence-based medicine: A new approach to teaching the practice of medicine, Journal of the American Medical Association, 286: 2420-2425. Fisher, R. (1947). The design of experiments (fourth edition). Edinburg:Oliver & Boyd. Giere, R. (1979). Understanding scientific reasoning. New York: Holt, Rinehart & Winston. Gupta, M. (2014). Is evidence-based psychiatry ethical? Oxford: Oxford University Press. Guyatt. G., Rennie, D., Meade, M. O., Cook, D. J. (2008). User’s Guide to the Medical Literature: A manual for evidence-based practice. 2nd Edition. New York: McGraw-Hill. Haig, B. (2014). Investigating the psychological world: Scientific method in the behavioral sciences. Cambridge: The MIT Press. Haynes, R. B. (2002). What kind of evidence is it that evidence-based medicine advocates want health care providers and consumers to pay attention to? BMC Health Services Research 2(3). Hess, M. (1974). The structure of scientific inference Berkeley: University of California Press. Husserl, E. (1962). Ideas: General introduction to pure phenomenology. (Trans. W. R. Gibson). New York: Collier Books. + Husserl, E. (1970). The crises of European sciences and transcendental phenomenology. (Trans. D. Carr). Chicago: Northwestern University Press. Husserl, E. (1977). Phenomenological psychology: Lectures, summer semester 1925. (Trans. J. Scanlon). The Hague: Martinus Nijhoff. LaCaze, A. (2009). Evidence-based medicine must be . . ., Journal of Medicine and Philosophy, 34:509-527. LaCaze, A. (2011). The role of basic science in evidence-based medicine. Biology and Philosophy, 26:81-98. Laudan, L. (1981). Science and hypothesis: Historical essays on scientific methodology. Dordrecht: Reidel. Lambert, M. (2003). The efficacy and effectiveness of psychotherapy. In M. Lambert (Ed.), Bergin and Garfield’s handbook of psychotherapy and behavior change (sixth edition), pp.69-218. New York: Wiley. Lexchin, J., Bero, L., Djulbegovic, B., & Clark, O. (2003). Pharmaceutical industry sponsorship and research outcome and quality: systematic review, British Medical Journal, 326:1167-70. Lilienfield, S. (2007). Psychological treatments that cause harm. Perspectives on psychological science, 2:53-70. Mill, J. S. (1959). A system of logic. London: George Routledge & Sons. Mohr, D., Spring, B., Freedland, K., Beckner, V., Arean, P., Hollon, S., Okene, J., Kaplan, R. (2009). The selection and design of control conditions for randomized control trials of psychological interventions. Psychotherapy and Psychosomatics, 78:275-284. Morrison, S. (2007). McMaster breakthrough ranks as a top medical milestone, Daily News, 9 (January). Newton, I. (1946). Newton’s mathematical principles of natural philosophy and his system of the world: Newton’s Principia. -20(Motle, A. trans., Cagoir, F. ed.) Berkeley: University of California Press. Open Peer Commentary (2013). European Journal of Personality, 27:120-144. Perlis, R., Perlis, C., Wu, Y., Hwang, C., Joseph, M. & Nierenberg, A. (2005). Industry sponsorship and financial conflict of interest in the reporting of clinical trials in psychiatry, American Journal of Psychiatry, 162:1957-60. Philipse, H. (1995). Transcendental idealism. In (B. Smith & D. W. Smith Eds.) The cambridge companion to Husserl. Cambridge: Cambridge University Press. Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: How much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10:712-713. Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and Quasi-experimental designs for generalized causal inferences. Boston:Houghton Mifflin. Spielmans, G., Pasek, L., & McFall, J. (2007). What are the active ingredients in cognitive and behavioral psychotherapy for anxious and depressed children? A meta-analytic review. Clinical Psychology Review, 27:642-654. Stiles, W., Hill, C., & Elliott, R. (2015). Looking both ways. Psychotherapy Research, 25:282-293. Straus, E. E. & McAlister, F. A. (2000). Evidence-based medicine: A commentary on common criticisms, Canadian Medical Association Journal, 163:837-841. Straus, S. E., Richardson, W. S., Glasziou, P., Haynes, R. B. (2005). Evidence-based medicine: How to practice and teach. (3d edition) London: Elsevier Churchill Livingstone. Strawson, G. (2008). Real materialism. Oxford: Clarendon Press. Swineburne, R. (1973). An introduction to confirmation theory. London: Oxford University Press. Szatmari, P. (2003). The art of evidence-based child psychiatry. Evidence-Based Mental Health, 6:1-3 Turner, E., Matthews, A., Lindardatos, E., Tell, R., Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358: 252-260. Tullock, G. (1959). Publication decisions and tests of significance: A comment. Journal of the American Statistical Association, 54:593. Woodward, J. (2010). Data, phenomena, signal, and noise. Philosophy of Science, 77:792-803. Woodward, J. (2011). Data and phenomena: A restatement and defense. Synthese 182:165-179. Worrall, J. (2002). What evidence in evidence-based medicine? In (J. Reiss ed.) Causality: metaphysics, and methods, technical report 01/03. London: London School of Economics.

Click here to the version.

Related documents

Products

Support

Click here to the version.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib