Realist Synthesis: Supplementary reading 6: Digging for nuggets: how ‘bad’ research can yield ‘good’ evidence 1 Digging for nuggets: how ‘bad’ research can yield ‘good’ evidence Abstract A good systematic review is often likened to the pre-flight instrument check ensuring a plane is airworthy before take-off. By analogy, research synthesis follows a disciplined, formalised, transparent and highly routinised sequence of steps in order that its findings can be considered trustworthy - before being launched on the policy community. The most characteristic aspect of that schedule is the appraise-thenanalyse sequence. The research quality of the primary studies is checked out and only those deemed to be of high standard may enter the analysis, the remainder being discarded. This paper rejects this logic arguing that the ‘study’ is not the appropriate unit of analysis for quality appraisal in research synthesis. There are often nuggets of wisdom in methodologically weak studies and systematic review disregards them at its peril. Two evaluations of youth mentoring programmes are appraised at length. A catalogue of doubts is raised about their design and analysis. Their conclusions, which incidentally run counter to each other, are highly questionable. Yet there is a great deal to be learned about the efficacy of mentoring if one digs into the specifics of each study. ‘Bad’ research may yield ‘good’ evidence - but only if the reviewer follows an approach which involves analysis-and-appraisal. Introduction Interest in the issue of ‘research quality’ is at an all time high. Undoubtedly, one of the key spurs to the quest for higher standards in social research is the evidence-based policy movement. The chosen instrument for figuring out best practice for forthcoming interventions in a particular policy domain is the systematic review of all first rate evidence from bygone studies in that realm. In trying to piece together the evidence that should carry weight in policy formation, a key step in the logic is to provide an ‘inclusion criterion’ as a means of identifying those studies upon which most reliance should be placed. This paper questions this strategy, beginning with a short rehearsal of the role of quality appraisal in systematic review. This brief history commences with a backward glance at the short-sighted world of meta-analytic reviews in which double-blinded, randomised controlled trials are deemed to provide the source of gold-standard evidence. It then moves to present day attempts to provide parallel appraisal tools for the full range of social research strategies. In particular, the difficulties of using quality frameworks to appraise qualitative research are discussed. One particular impediment is highlighted in the focal section of the paper. My basic hypothesis is that the appraisal tool has to work in parallel with the nature of the synthesis. If, as in meta-analysis, that synthesis is arithmetic and has no other objective than to calculate the mean effect of a class of programmes then indeed there is no need to look beyond RCTs as the source of primary evidence. If, however, the ambition is to provide an explanatory synthesis then the appraisal tool should be subordinate to the particular 2 explanation being pursued. The wide-ranging, whole-study nature of qualitative appraisal tools cover aspects of the primary inquiries that may be quite irrelevant to explanatory thesis pursued in a review. The consequence of wielding such a generic quality axe is that credible explanatory messages from otherwise poor studies are lost to the review. The final section of the paper forwards an alternative approach to research appraisal, considering how to make the most of mixed messages from curate’s eggs. Two qualitative studies from the field of mentoring research are used as illustrations. Quality appraisal in systematic review Systematic reviews carry supreme influence in the world of evidence-based policy, so argue the disciples, because they are ruthlessly methodical. A set of steps, known as a ‘protocol’, is established though which all reviews have to pass, and in the passing they are deemed to produce a cumulative and objective body of knowledge. I reproduce a template of the classic meta-analytic review in Figure 1. There are any number of such operational diagrams and flowcharts in the literature. Some specify six stages, some seven and more. Some include a preliminary feasibility study. Some include a planning stage. Some make room for periodic updating as new primary studies continue to trickle in. My rather more humble effort below (based on Alderson et al, 2003 and CRD, 2001) is only intended to capture the essentials: Figure 1: Simplified systematic review template 1. Formulating the review question. Identifying the exact hypothesis to be tested about the particular efficacy of a particular class of interventions. 2. Identifying and collecting the evidence. Searching for and retrieving all the relevant primary studies of the intervention in question. Comprehensive probing of data-banks, bibliographies and websites. 3. Appraising the quality of the evidence. Deciding which of the foregathered studies is valid and serviceable for further analysis by distinguishing rigorous from flawed primary studies. 4. Extracting and processing the data. Presenting the raw evidence on a grid, gathering the relevant information from all primary inquiries. 5. Synthesising the data. Collating and summarising the evidence in order to address the review hypothesis. Using statistical methods to estimate the mean effect of a class of programmes. 6. Disseminating the findings. Reporting the results to the wider policy community in order to influence new programming decisions. Identifying best practice and eliminating dangerous interventions. Our interest here lies, of course, in step 3 – the ‘critical appraisal’ or ‘quality threshold’ stage in which the primary studies are lined up and inspected for their rigour and trustworthiness. The implication here is that ‘evidence’ comes in all sorts of shapes and sizes, with some of it being distinctly sloppy and untrustworthy. This sentiment is rooted in evidence-based medicine and is fixed upon its bête noire, namely - the ‘expert opinions’ of doctors and clinicians. “The publication of Archie Cochrane’s radical critique, Efficiency and Effectiveness, 25 years ago stimulated a penetrating examination of the degree 3 to which medical practice is based on robust demonstrations of clinical effectiveness... Instead of practice being dominated by opinion (possibly illinformed) and by consensus formed in poorly understood ways by ‘experts’, the idea is to shift the centre of gravity of health care decision making towards an explicit consideration and incorporation of research evidence.” (Sheldon, 1997). Here is the call for a methodological tribunal at the centre of systematic reviews. The core idea is to exclude entirely from further analysis any research that falls short of acceptable scientific standards. There are a number of such ‘hierarchies of evidence’ designed for evidence-based policy, which use a variety of rankings and subgroupings. Figure 2, based on Davies et al (2000) and CRD (2001), serves as a rough amalgam, illustrating the standing of different strategies as perceived by the ‘rigorous paradigm’: Figure 2: Simplified structure of the hierarchy of evidence in meta-analysis Level 1: Randomised controlled trials (with concealed allocation) Level 2: Quasi-experimental studies (using matching) Level 3: Before-and-after comparisons Level 4: Cross sectional, random sample studies Level 5: Process evaluation, formative studies and action research Level 6: Qualitative case study and ethnographic research Level 7: Descriptive guides and examples of good practice Level 8: Professional and expert opinion Level 9: User opinion The qualitative turn Needless to say, such a pecking order has proved highly controversial. It is redundant in many areas of policy evaluation in which RCTs are not feasible and it is a strange ‘gold standard’ that is quite unable to peer inside the ‘black box’ of programme implementation. Methodological fisticuffs of this sort are not my topic on this occasion, so it suffices to invoke the stock criticism here, which is that such a hierarchy undervalues the contribution made by other research perspectives. What should be happening, choruses the opposition, is that the pecking order should be replaced by a horses-for-courses approach, which recognises equally the contribution made to programme understanding by other research strategies. Let me quote an example of this sentiment, with no further embellishment other than to encourage the reader into noting the name of its author: “Qualitative knowledge is absolutely essential as a prerequisite foundation for quantification in any science. Without competence at the qualitative level, one’s computer printout is misleading or meaningless. We failed in our thinking about programme evaluation methods to emphasise the need for a qualitative context that could be depended upon… To rule out plausible hypotheses we need situation specific wisdom. The lack of this knowledge (whether it be called ethnography or program history or gossip) makes us incompetent estimators of programme impacts, turning 4 out conclusions that are not only wrong, but often wrong in socially destructive ways.” (Campbell, 1984) Campbell’s entreaty here has been taken up enthusiastically, if not always by members of the evidence-based policy Collaboration named in his honour, many of whom remain die-hard supporters of the RCT (Farrington and Walsh, 2001). And in respect of critical appraisal, the consequence is that there are now several dozen attempts to set down frameworks and tools to assess the quality of qualitative research1. I bring my brief history up to date by describing, arguably, the most significant of these – namely ‘Quality in Qualitative Evaluation’ a report produced for the UK Cabinet Office (Spencer et al, 2003). What hits one in the eye about this particular appraisal checklist is the enormity of its reach; to be sure it is a ‘full kit inspection’. The key reform is to inculcate standards for a much greater range of investigatory activities. In the classic meta-analytic reviews the quality assessment focus goes little beyond design issues (i.e. is it a double-blinded RCT?) as the telling feature of the original studies. The new standards establishment aims to cover all phases of the research cycle. And in many ways this is an entirely reasonable expectation. There is no pronounced emphasis in qualitative research on design. In these circles it is no disgrace to suck-it-and-see, with terms like ‘unstructured’, ‘flexible’, and ‘adaptive’ being the watchwords of a good design. Accordingly, good qualitative research is widely recognised as multi-faceted (or ‘organic’ or ‘holistic’) and the Cabinet Office framework responds by having something to say under 8 major headings on the following aspects of research: ‘findings’, ‘sample’, ‘data collection’, ‘analysis’, ‘reporting’, ‘reflexivity and neutrality’, ‘ethics’ and ‘auditability’. Each of these features is then subdivided so that, for instance, in considering the quality of ‘findings’ the reviewer is expected to gauge their ‘credibility’, their ‘knowledge extension’, their ‘delivery on objectives’, their ‘wider inferences’ and the ‘basis of evaluative appraisal’. The original subdivision into the 8 research stages thus jumps to 18 major themes or ‘appraisal questions’. I will not list them all here because for each of these questions there are then ‘quality indicators’, usually running to 4 to 5 per theme. This leaves us with a final set of 75 indicators with which to judge the quality of qualitative research. Again, I refrain from attempting a listing, though it worth reproducing a couple of items (below) for they illustrate the intended mode of application. That is to say, the indicators are not decision points (is it an RCT or not?). Rather they invite the appraiser to examine rather more complex propositions as ‘possible features for consideration’, as for example: Is there discussion of access and methods of approach and how these might have affected participation/coverage? (a ‘sampling’ indicator) Is there a clear conceptual link between analytic commentary and presentation of original data? (a ‘reporting’ indicator) 1 Alas it is impossible to appraise all of these appraisal tools here. For example, another candidate for inspection might be the approached used by the EPPI group, on which there has already been a ferocious barrage of opinion and counter opinion (Oakley 2003; MacLure 2005). I pinpoint the Cabinet Office study for its provenance, because it is a distillation of many previous schemas and, above all, because it is the most clearly, formally and openly articulated. 5 As if all this impeccable detail were not enough, Spencer et al’s report concludes with some self-reflection on possible omissions from the framework (such as insufficient coverage of ethics and the failure to include different sub-types of qualitative research such as discourse analysis). To summarise, however, one can say that the final product is remarkably comprehensive, covering at least the A-to-Y of qualitative inquiry. Using the instrument But now the point is reached for my own critical appraisal. The question I want to concentrate on is the utility of the instrument. Can such a tool be used to support systematic review? Can it act as a quality filter with which to appraise primary studies? Such an expectation undoubtedly fed into the commissioning of the framework. The underlying ambition is to draw a firm parallel with the Cochrane model. In order to upgrade the profile of qualitative research in the policy process, the strategy is to provide a rigorous and defensible ‘inclusion criterion’ as a means sorting wheat from chaff in a paradigm that feels exposed to charges of subjectivity and bias. I begin my critique by briefly rehearsing five practical impediments to using such an instrument as a quality threshold. My initial aim is to air some of the ongoing discussion about the difficulties of arriving at any quality checklist for qualitative research. In making these points, I do not want to imply that there is any naiveté on this score on the part of the team that devised the model under discussion; indeed they stress its status as a ‘framework’ and call for ‘field testing’ in order to perfect it for usage (Spencer et al 2003:107). The following quintet of pitfalls anticipates some of the grave practical difficulties inherent in qualitative quality assessment. They should, however, be regarded as a preliminary to a sixth and underlying critique which argues that the very idea of using generic quality frameworks misunderstands the essential nature of research synthesis. I. Boundless Standards. Broadening the domain of quality standards creates quality registers that are dinosaurian in proportion. Spencer et al’s assessment grid looks rather more like a scholarly overview of methodological requirements than a practicable checklist. A reviewer might well have to put hundreds of primary studies to the microscope. It takes little imagination to see that wading through the totality of evidence in the light of seventy-five general queries (rather than a pointed one on the design employed) can render the exercise unmanageable. II. Abstract Standards. Broadening the domain of quality standards results in the usage of ‘essentially contested concepts’ (Gallie, 1964) to describe the requisite rules. Amongst the ‘weasel words’ that find their way into the Cabinet Office criteria are the requirements that the research should have ‘clarity’, ‘coherence’ and ‘thoughtfulness’, that it should be ‘structured’ and ‘illuminating’ and so forth. By contrast, it is fairly easy to decipher whether a study has or has not utilised an RCT and thus deliver a pass/fail verdict on its quality. But a concern for rigour, clarity, sensitivity and so forth generates far, far tougher calls. III. Imperceptible Standards. Broadening the domain of quality standards exacerbates one of the standard predicaments of systematic review, namely the 6 foreshortening of research reportage caused by publishing and reporting conventions. Inevitably, the first victim of ‘word-length’ syndrome, especially in journal formats, is the technical detail on access, design, collection, analysis and so forth. The consequence, of course, is that an appraisal of research standards cast in terms of 75 methodological yardsticks will frequently have no material with which to work. IV. Composite standards. Broadening the domain of quality standards also raises novel questions about their balance. Should ‘careful exposition of a research hypothesis’ be prized more than ‘discussion of fieldwork setting’? And how do these weigh up alongside ‘clarity and coherence of reporting’? The permutations, of course, increase exponentially when one is faced with quality indicators by the score. By and large the new standards regime has resisted formulating an algorithm to calculate the relative importance of different contributions (Spencer et al 2003: 82). Indeed the tendency is to resist altogether the ‘marking scheme’ approach implicit in the orthodox hierarchies illustrated in Figure 2. V. Permissive standards. With all of the above problems squarely in mind, most recent standards compendia come with provisos stressing that their application requires ‘judgement’. The following from the Cabinet Office Report is a prime example, ‘We recognise that there will be debate and disagreement about the decisions we have made in shaping the structure, focus and content of the framework. The importance of judgement and discretion in the assessment of quality is strongly emphasised by authors of other frameworks, and it was underlined by participants in our interviews and workshops. We think it is critical that the framework is applied flexibly, and not rigidly or prescriptively: judgement will remain at the heart of assessments of quality.’ (Spencer et al 2003: 91). One would be hard put to find a more sensible statement on the sensitivity needed in quality appraisal. But, for some, such a twist heralds the arrival into research synthesis of a decidedly oxymoronic character, the ‘permissive standard’. Whichever view one holds, it is clear that the application of quality standards is moving further and further away from being an efficient and unproblematic preliminary to systematic review. VI. Goal-free standards. Broadening the domain of quality standards looses sight of their function within a review. With this little proposition, I arrive at my main critique and so linger longer on this point. As we have seen, there is a self-defeating element in these efforts to build a quality yardstick for qualitative research. The more rigorous the exploration of conduct of qualitative inquiry, the more unwieldy becomes the appraisal apparatus. The reason is obvious. The standards echo the very nature of qualitative explanation. They capture the way that ethnographic accounts are created. Qualitative explanation has been portrayed in many ways, some of the best-known labels being: ‘thick description’, ‘pattern explanation’, ‘intensive analysis’, ‘explanatory narratives’, and ‘ideographic explanation’. These are different ways of getting at the same thing, namely that qualitative inquiry works by building up manysided descriptions, the explanatory import of which depends on their overall coherence. Such holistic explanations are, moreover, developed and refined in the field, over time and via the assistance of another set of analytic processes such as ‘reflexivity’, ‘respondent validation’, ‘triangulation’, and ‘analytic induction’. Little wonder then 7 that the evaluative tools needed to check out all this activity have grown exponentially. The end product is a set of desiderata rather than exact rules, a cloak of ambitions rather than a suite of performance indicators, a paradigm gazetteer rather than a decision making tool, a collection of tribal nostrums rather than critical questions, a methodological charter rather than a quality index. So how can we get qualitative quality appraisal to work? The answer resides in neither technical trick nor quick fix. There is no point in achieving this fined-grained appreciation of the multi-textured nature of qualitative research only to ditch it via the production of an abridged instrument cropping the 75 carefully identified issues to, say, a rough and ready 7. One needs to go back to square one to appreciate the problem. And the culprit here is that the ambition to create a quality appraisal tool (and one, moreover, that matches the muscularity of the quantitative hierarchies) has run ahead of consideration of its function within a systematic review. Whatever the oversimplifications of the RCT-or-bust approach, it does produce a batch of primary studies up and ready to deliver a consignment of net effects that can be synthesised readily into an overall mean effect. Form fits function. The qualitative quality appraisal tools, by contrast, are functionless; they are generic. Accordingly, the prior question regarding qualitative assessment is – what do we expect the synthesis of qualitative inquiries to deliver? Tougher still, what is the expected outcome of the synthesis of an array of multi-method studies? An initial temptation might be to remain true to the ‘additive’ vision of meta-analysis – but a moment’s thought tells us there is no equivalent of adding and taking the mean in qualitative inquiry. In general, a model of synthesis-as-agglomeration seems doomed to failure. Any single qualitative inquiry, any one-off case study, produces a mass of evidence. As we have seen, one of the marks of good research of this ilk is the ability to fill out the pattern, to capture the totality of stakeholders’ views, to produce thick description. But these are not the qualities that we aspire to in synthesis. We do not want an endless narrative; we do not seek a compendium of viewpoints; we do not crave thicker description! Thus lurks a paradox – the yardsticks of good qualitative research do not correspond to the hallmarks of good synthesis (and, more especially, the format of practicable policy advice). And with this thought, we arrive at the negative conclusion to the paper. Since the synthetic product is never going to be composed holistically then the full-kit inspections of each component study is not only unwieldy, but also quite unnecessary. Digging for nuggets in systematic review In this section I attempt to solve the paradox. Since I have argued that the strategy for quality appraisal is subordinate to the objective of a review, I commence with a brief description of an alternative view of how evidence can be brought together to inform policy. This is no place to introduce a paradigm shift in systematic review, so I refer readers to fuller accounts in Pawson et al (2004) and Pawson (2006). The realist perspective views research synthesis as a process of theory testing and refinement. Policies and programmes are rooted in ideas about why they will solve social problems, why they will change behaviour, why they will work. Realist synthesis thus starts by articulating key assumptions underlying interventions (known as ‘programme theories’) and then uses the existing research as case studies with which to test that those theories. 8 The overall expectation about how evidence will shape up is as follows. Some studies may indicate that a particular intervention works and, if they have a strong qualitative component, they may be able to say why the underlying theory works. Since there are no panacea programmes on this earth, other studies will chart instance of intervention failure and may also be able to describe a thing or two about why this is the case. Synthesis, in such instances, is not a case of taking averages or combining descriptions but rests on explanatory conciliation. That is to say, a refined theory emerges better to explicate the scope of the original programme theory. The aim of working through the primary research is, in short, to provide a more subtle portrait of intervention success and failure. The strategy is to provide a comprehensive explanation of the subjects, circumstances, and respects in which a programme theory works (and in which fails). And that, in a nutshell, is the basic logic of realist synthesis. Much more could be said about how the preliminary theories are chosen and expressed, how primary studies are located and selected, how evidence is extracted and compared, and so on. But this paper is directed at research quality and it is to some rather uncompromising ramifications on this score that I return. The quality issue is transformed in two ways, both due to the fact that programme theories are the focus of synthesis. I. The whole study is not the appropriate unit of quality appraisal. Primary studies are unlikely to have been constructed with an exploration of a particular programme theory as their raison d’être. More probably, the extant research will have been conducted across a multiplicity of banners under which evaluation research and policy analysis is organised. However, in so far as they will have a common commitment to understanding an intervention, few of these investigations will have absolutely nothing to say about why programmes work. In the case of qualitative research, there is a reasonable expectation that key programme theories will get an airing – alongside lots of other material on location, stakeholders, meanings, negations, power plays, implementation hitches, etc.etc. This raises a completely revised expectation about research synthesis, namely that evidential fragments rather than entire studies should be the unit of analysis. And in terms of research quality there is a parallel transformation down to the level of the specific proposition. Because synthesis takes a specific analytic cut through them, it is not a sensible requirement that every one of the many-sided claims in qualitative research must be defensible. What must be secure, however, are those elements that are to be put to use. In short, the implication on research quality goes back to the title of this paper and the idea an otherwise mediocre study can indeed produce pearls of explanatory wisdom, II, Research quality can only be determined within the act of synthesis. The notion that research synthesis is the act of developing and refining explanations also has a profound implication for the timing of any quality assessment. Theory development is dynamic. Understanding builds throughout an inquiry. Evidential requirements thus change though time and quality appraisal needs to be sensitive to this expectation. The notion of ‘explanation-sensitive’ standards will ring alarm bells in the homogenised world of meta-analysis (though it does, incidentally, clarify the somewhat oxymoronic idea of ‘permissive standards’ mentioned earlier). However, there is nothing alien to scientific inquiry in such a notion. The iterative relationship between theory and data 9 is a feature of all good inquiry. All inquiry starts with understanding E1 and moves on to more nuanced explanations E2, E3, … EN and in the course of doing will gobble up and spit out many different kinds of evidence. Applying this model to research synthesis introduces a different primary question for quality appraisal, namely – can this particular study (or fragment thereof) help, and is it of sufficient quality to help in respect of clarifying the particular explanatory challenge that the synthesis has reached? Such a question can only be answered, of course, relative to that point of analysis and, therefore, in the midst of analysis. In short, the worth of a primary study is determined in the synthesis. Pearls of wisdom – about mentoring This final section has the task of bringing these abstract methodological musings to life via an illustration. The example is drawn from a review I conducted on ‘mentoring relationships’ (Pawson, 2004). Clearly, I can do little more than give a flavour of a review that provides a hundred page synthesis of the evidence from 25 key primary studies. What I’ve chosen to reproduce, therefore, is some of the material from the first two cases. These are selected because they confront the reviewer with a severe challenge from the point of view of research quality. Both are qualitative studies. Each carries, quite distinctly and unmistakably, the voice, the preferences and the politics of its author. Arguably, they represent dogma rather than data. Without doubt they would be apportioned to the pile of rejects in a Cochrane or Campbell review. Even more interesting is how they would fare under the Cabinet Office quality appraisal checklist and, in particular, the criterion that there should be ‘a clear conceptual link between analytic commentary and presentation of original data’. As we shall see, these studies make giant and problematic inferential leaps in passing from data to conclusions. In short, this pair of inquiries tests to its limits my thesis about looking for pearls of wisdom rather than acres of orthodoxy. The first stage of realist synthesis is to articulate the theory that will be explored via the secondary analysis of existing studies. Thereafter, there is the hard slog of searching for and assembling the studies appropriate to this task. Eventually, we get down to the synthesis per se. In adhering to the principles developed thus far, it is clear that the reviewer has rather a lot of work to do in melding any study into an explanatory synthesis. For the purposes of this exposition, four tasks are highlighted. There has, of course, to be some initial orientation for the reader about the purpose of the primary research, its method and its conclusions. One is not, however, reviewing a study in its own terms, so the core requirement for report is a consideration of the implications of the study for the hypothesis under review. Alongside this, according to the argument above, will be an assessment of research quality. To repeat, this is not directed at the entirety of a primary study and all of its conclusions. Rather, what is pursued is a different question about whether the original research warrants the particular inference drawn from it by the reviewer. Finally, the synthesis should take stock of how the review hypothesis has been refined in the encounter with each evidential stepping stone. Let me commence the illustration by putting the review hypothesis in place before we assess the contribution of this brace of studies. The review examines mentoring programmes in which an experienced mentor is partnered with a younger, more junior 10 mentee with the idea of passing on wisdom and individual guidance to help the protégé through life’s hurdles. Such an idea has been used across policy domains and in all walks of life. The review concentrates on so-called ‘engagement mentoring’, that is dealing with ‘disaffected’, ‘high-risk’ youth and helping them move them into mainstream education and training. It may help readers to orient themselves to the idea by mentioning the most renowned and longstanding of these programmes, namely Big Brothers and Big Sisters of America. Much more than in any other type of social programme, interpersonal relationships between stakeholders embody the intervention. They are the resource that is intended to bring about change. Accordingly, the theory singled out for review highlighted the intended ‘function’ of mentoring. Youth mentoring programmes carry expectations about a range of such functions, which are summarised in Figure 3. Figure 3: A basic typology of mentoring mechanisms advocacy (positional resources) coaching (aptitudinal resources) direction setting (cognitive resources) affective contacts (emotional resources) ‘long-move’ mentoring Starting at the bottom, it is apparent that some mentors see their primary role as offering the hand of friendship; they work in the affective domain trying to make mentees feel differently about themselves. Others provide cognitive resources; offering advice and a guiding hand though the difficult choices confronting the mentee. Still others place hands on the mentees shoulders – encouraging and coaxing their protégés into practical gains, skills and qualifications. And in the uppermost box, some mentors grab the mentees’ hands, introducing them to this network, sponsoring them in that opportunity, using the institutional wherewithal at their disposal. In all cases the mentoring relationship takes root and change begins only if the mentee takes willingly the hand that is offered. Put simply, the basic thesis under review was that successful engagement mentoring involves the ‘long move’ though all of theses stages. The disaffected mentee will not suddenly leap into employment, and support within a programme must be provided to engender all of the above stages. This proposition leads to further hypotheses that were tested in the review about limitations on the individual mentor’s capacity to fulfil each and all of these demanding roles. The review, in short, examines the primary evidence with a view to discovering whether this sequence of steps is indeed a requirement of successful programmes. More particularly, it interrogates the evidence in respect of the resources of the mentor and of the intervention, in order to ascertain which practitioners and which delivery arrangement are best placed to provide this extensive apparatus. Let us now move on to the contribution of the first two primary investigations in exploring this thesis. The reader will find that each passage of synthesis is made up of 11 the quartet of methodological tasks mentioned above (basic orientation, hypothesis testing, quality appraisal, hypothesis refinement).The aim is to advance the rudimentary conjectures about the ‘long move’ and its contingencies. Given that the focus of the present is on research quality, I also emphasise the decisions made by way of ‘appraising’ the studies. Study 1. de Anda D (2001) A qualitative evaluation of a mentor program for atrisk youth Child and Adolescent Social Work Journal 18 (97-117) This is an evaluation of project RESCUE (Reaching Each Students Capacity Utilizing Education). Eighteen mentor-mentee dyads are investigated from a small, incorporated city in Los Angeles, with high rates of youth and violent crime. The aims of the programme are described in classic ‘long move’ terms: ‘The purpose of this relationship is to provide a supportive adult role model, who will encourage the youth’s social and emotional development, help improve his/her academic and career motivation, expand the youth’s life experiences, redirect the youth from at-risk behaviours, and foster improved self-esteem’. A curious, and far from incidental point, is that the volunteer mentors on the RESCUE programme were all fire fighters. The research takes the form of a ‘qualitative evaluation’ (meaning – an analysis consisting of ‘group interview’ data and of biographical ‘case histories’). There is a major claim in the abstract that the mentees are shown to secure ‘concrete benefits’, but these are mentioned only as part of the case studies narratives, there being no attempt to chart inputs, outputs and outcomes. The findings are, in the author’s words, ‘overwhelmingly positive’. The only hint of negativity comes in reportage of the replies to a questionnaire item about whether the mentees would like to ‘change anything about the programme’. The author reports, ‘All but three mentees answered the question with a “No” response’. Moreover, de Anda indicates that two of these malcontents merely wanted more ‘outings’ and the third, more ‘communication’. As for critical appraisal, the research could be discounted as soppy, feel-good stuff, especially as all of the key case study claims are in the researcher’s voice. (e.g. ‘the once sullen, hostile, defensive young woman now enters the agency office with hugs for staff members, a happy disposition and open communication with adult staff members and the youth she serves in her agency position’). The case studies do, however, provide a very clear account of an unfolding sequence of mentoring mechanisms: “Joe had been raised in a very chaotic household with his mother as the primary parent, his father’s presence erratic …He was clearly heading towards greater gang involvement… He had, in fact, begun drinking (with a breakfast consisting of a beer), demonstrated little interest in school and was often truant… The Mentor Program and the Captain who became his mentor were ideal for Joe, who had earlier expressed a desire to become a firefighter. The mentor not only served as a professional role model, but provided the nurturing father figure missing from his life. Besides spending time together socially, his mentor helped him train, prepare and discipline himself for the Fire Examiners test. Joe was one of the few who passed the test (which is the same as the physical test given to firefighters). A change in attitude, perception of his life, and attitudes and life goals was evident … [further long, long story omitted] … He also enrolled at the local junior college in classes (e.g. for paramedics) 12 to prepare for the firefighters’ examination and entry into the firefighters academy. He was subsequently admitted to the fire department as a trainee.” What we have here is a pretty full account of a successful ‘long move’ and the application of all of the attendant mechanisms – progressing from affective contacts (emotional resources) to direction setting (cognitive resources) to coaching (aptitudinal resources) to advocacy (positional resources). The vital evidence fragment for the review is that this particular mentor (‘many years of experience training the new, young auxiliary firefighters as well as the younger Fire explorers’) was quite uniquely positioned. As Joe climbs life’s ladder, away from his morning beer, the Captain is able to provide all the resources needed to meet all his attitudinal, aptitudinal, and training needs. How frequently such state of affairs applies in youth mentoring is a moot point and acknowledged only in the final paragraph of the paper. In its defence, one can point to two more plausible claims of the paper. There is constant refrain about precise circumstantial triggers and points of interpersonal congruity that provide the seeds of change. ‘It was at this point [end of lovingly described string of bust-ups] that Gina entered the Mentor programme and was paired with a female firefighter. The match was a perfect one in that the firefighter was seen as “tough” and was quickly able to gain Gina’s confidence.’ There is also an emphasis via the life history format on the holistic and cumulative nature of the successful encounter. ‘The responses and case descriptions do provide a constellation of concrete and psychosocial factors which the participants felt contributed to their development and success.’ These are the evidence fragments that are taken forward in the review and which, indeed, find further support in subsequent studies. In short, this example encapsulates the dilemma of incorporating highly descriptive qualitative studies in a review. We are told about ‘overwhelmingly positive results’, all the ‘evidence’ is wrapped up in the author’s narrative, and there is no attempt at strategies such as respondent validation. Compared to qualitative research of a higher method calibre, there is no material in the study allowing the reader the opportunity to ‘cross-examine’ its conclusions. Read at face value it tells us that engagement mentoring works. Read critically it screams of bias. Read synthetically, there is nothing in the account to suggest a general panacea and much to suggest a special case. The key point, however, is that some vital explanatory ingredients are unearthed (that the long move is possible given a well-positioned mentor, established community loyalties, specific interpersonal connections and interests, and multiobjective programme) and not lost to the review. Study 2. Colley H (2003) Engagement mentoring for socially excluded youth: problematising an ‘holistic’ approach to creating employability through the transformation of habitus. British Journal of Guidance and Counselling 31 (1) pp 77-98. Here we transfer from American optimism to British pessimism via the use of the same research strategy. The evidence here is draw from a study of a UK government scheme (‘New Beginnings’) which, in addition to basic skills training and work placement schemes, offered a modest shot of mentoring (one hour per week). This scheme is one of several in the UK mounted out of a realisation that disaffected youth have multiple, deep-seated problems and, accordingly, ‘joined up’ service provision is 13 required to have any hope of dealing with them. Colley’s study takes the form of series of qualitative ‘stories’ (her term) about flashpoints within the scheme. She selects cases in which the mentor ‘demonstrated an holistic person-centred commitment to put the concerns of the mentor before those of the scheme’ and reports that, ‘sooner of later these relationships break down’. The following quotations provide typical extracts from ‘Adrian’s story’: ‘Adrian spoke about his experience of mentoring with evangelical fervour: ‘To be honest, I think anyone who’s in my position with meeting people, being around people even, I think a mentor is one of the greatest things you can have … [passage omitted]. If I wouldn’t have had Pat, I think I’d still have problems at home … You know, she’s put my life in a whole different perspective.’ Adrian was sacked from the scheme after 13 weeks. He was placed in an office as filing clerk and dismissed because of lateness and absence. Colley reports that, despite his profuse excuses, the staff felt he was ‘swinging the lead’. Pat (the mentor) figured otherwise: “Pat, a former personnel manager and now student teacher, was concerned that Adrian had unidentified learning difficulties that were causing him to miss work though fear of getting things wrong. She tried to advocate on his behalf with New Beginnings staff, to no avail.” At this point Adrian was removed from the scheme. Another ‘story’ is shown to betray an equivalent pattern, with the mentor supporting the teenage mentee’s aspiration to become a mother and to eschew any interest in work (and thus the programme). From the point of view of the review theory, there is an elementary ‘fit’ with the idea of the difficulties entrenched in the ‘long move’. Mentors are able to provide emotional support and a raising of aspirations but cannot, and to some extent will not, provide advocacy and coaching. On this particular scheme, the latter are not in the mentor’s gift but the responsibility of other New Beginnings staff (their faltering, bureaucratic efforts also being briefly described). And what of quality appraisal? Colley displays the ethnographer’s art in being able to bring to life the emotions described above. She also performs ethnographic science in the way that these sentiments are supported by apt and detailed and verbatim quotations from they key players. Compared to case study one, the empirical material might be judged as more authentically the respondent’s tale rather than the reviewer’s account. But then we come to the author’s interpretations and conclusions. On the basis of these two case illustrations, the inevitability of mentoring not being able to reach further goals on employability is assumed. This proposition is supported in a substantial passage of ‘theorising’ about the ‘dialectical interplay between structure and agency’, via Bourdieu’s concept ‘habitus’, which Colley explained as follows: “ .. a structuring structure, which organises practices and the perceptions of practices, but also a structured structure: the principle of division into logical classes which organizes the perception of the social world is itself the product of internalisation of the division into social classes’. Put in more downright terms, this means that because of the way capitalist society is organised the best this kind of kid will get is a lousy job and whatever they do will be 14 taken as a sign that they barely deserve that. In Colley’s words, ‘As the case studies illustrate, the task of altering habitus is simply unfeasible in many cases, and certainly not to a set timetable’. It is arguable that this interpretative overlay derives more from the author’s self-acknowledged Marxist/feminist standpoint than from the empirical case studies presented. There is also a further very awkward methodological aspect for the reviewer in a ‘relativistic’ moment one often sees in qualitative work, when in the introduction to her case studies Colley acknowledges that her reading of them is ‘among many interpretations they offer’. There are huge ambiguities here, normally shoved under the carpet in a systematic review. Explanation by ‘theorising’ and an underlying ‘constructivism’ in data presentation are not the stuff of study selection and quality appraisal. Realist synthesis plays by another set of rules, which are about drawing warrantable inferences from the data presented. Thus, sticking just to Colley’s case studies in this paper, they have value to an explanatory review because they exemplify in close relief some of the difficulties of ‘long move’ mentoring. In the stories presented, the mentor is able to make headway in terms of befriending and influencing vision but these gains are thwarted by programme requirements on training and employment, over which the mentor had little control. What they show is the sheer difficulty that an individual mentor faces in trying to compensate for lives scarred by poverty and lack of opportunity. Whether the two instances demonstrate the ‘futility’ or ‘unfeasibility’ of trying to do so and the unfaltering grip of capitalist habitus and social control is a somewhat bolder inference. The jury (and the review at this point) is still out on that question. Conclusion These two inquiries, of course, represent only the initial skirmishes of the review. Both of them are flawed but both of them present useable evidence about the circumstances in which mentoring can and cannot flourish. Exploration of further inquires lends detail to the developing theory, giving a picture both of the likelihood of youth mentoring programmes being able to make the ‘long move’ and a model of the additional process and resources needed to help in its facilitation. What is discovered, of course, is a further mix of relatively successful and unsuccessful programmes. But a pattern does begin to emerge about the necessary ingredients of mentoring relationships. Goals can begin to be achieved if the mentor has a similar biography, if she shares the everyday interests of the mentee, if he has the resources to build and rebuild the relationship, and if she is able to forge links with the mentee’s family, peers, community, school and college (for the full model consult Pawson, 2004). There is one final methodological point to report of the continued journey. The additional studies included in the synthesis employed a variety of research strategies (RCT, survey, mixed-method, path analysis and, indeed, an existing meta-analysis). These tax the poor reviewer in having to make a quality appraisal across each and every one of these domains. My approach remained the same. It is not necessary to draw upon a full, formal and preformulated apparatus to make the judgement on research quality. The only feasible approach is to make the appraisal in relation to the precise usage of each fragment of evidence within the review. The worth of a study is determined in the synthesis. 15 References Alderson P, Green S, Higgins J (2003) Cochrane Reviewers’ Handbook The Cochrane Library, John Wiley: Chichester Campbell D (1984) ‘Can we be scientific in applied social science?’ in Conner R, Altman D and Jackson C (eds.) Evaluation Studies: Review Annual (vol. 9) Beverly Hills: Sage. Colley H (2003) Engagement mentoring for socially excluded youth: problematising an ‘holistic’ approach to creating employability through the transformation of habitus. British Journal of Guidance and Counselling 31 (1) pp 77-98. Centre for Reviews and Dissemination, (2001) Undertaking Systematic Reviews of Research on Effectiveness CRD Report Number 4: University of York Davies H, Nutley S, and Smith P (2000) What Works? Evidence based policy and practice in public services? Bristol: Policy Press de Anda D (2001) A qualitative evaluation of a mentor program for at-risk youth Child and Adolescent Social Work Journal 18: 97-117. Farrington D and Walsh B (2001) What works in preventing crime? Systematic reviews of experimental and quasi-experimental research. Annals of the Academy of Political and Social Sciences 578 pp 8-13. Gallie W. (1964) Essentially contested concepts in philosophy and the historical understanding. London: Chatto and Windus MacLure M (2205) ‘Clarity bordering on stupidity’; where’s the quality in systematic review? Journal of Education Policy, 20, 4 (no page numbers – in press) Oakley A (2003) Research evidence, knowledge management and educational practice London Review of Education 1:21-33 Pawson R (2004) ‘Mentoring Relationships: an Explanatory Review’, Working Paper 21 ERSC UK Centre for Evidence based Policy and Practice. Available at: www.evidencenetwork.org/Documents/wp21.pdf Pawson R (2006 forthcoming) Evidence-based Policy: A Realist Perspective London: Sage Pawson R, Greenhalgh P, Harvey G, Walshe K (2004) ‘Realist Synthesis: an Introduction’ ERSC Research Methods Programme Papers (no2). Available at www.ccsr.ac.uk/methods/publications/RMPmethods2.pdf 16 Spencer L, Ritchie J, Lewis J and Dillon, L (2003) Assessing Quality in Qualitative Evaluation, Strategy Unit, Cabinet Office. 17