Promoting a RealWorld and Holistic approach to Impact Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints Workshop for staff of MFA of Finland Helsinki, 27-28 September 2012 Facilitated by Oumoul Ba Tall and Jim Rugh Note: This PowerPoint presentation and the summary (condensed) chapter of the book are available at: www.RealWorldEvaluation.org 1 Introduction to Impact Evaluation (quoting NONIE Guidelines) “In international development, impact evaluation is principally concerned with final results of interventions (programs, projects, policy measures, reforms) on the welfare of communities, households and individuals.” [p. 3] “No single method is best for addressing the variety of questions and aspects that might be part of impact evaluations. However, depending on the specific questions or objectives of a given impact evaluation, some methods have a comparative advantage over others in analyzing a particular question or objective. Particular methods or perspectives complement each other in providing a more complete ‘picture’ of impact.” [p. xxii] 2 Introduction to Impact Evaluation (quoting NONIE Guidelines) Three related problems that quantitative impact evaluation methods attempt to address are the following: • • • The establishment of a counterfactual: What would have happened in the absence of the intervention(s)? The elimination of selection effects, leading to differences between the intervention (treatment) group and the control group. A solution for the problem of unobservables: The omission of one or more unobserved variables, leading to biased estimates. [p. 23] 3 Introduction to Impact Evaluation (quoting NONIE Guidelines) “The safest way to avoid selection effects is a randomized selection of the intervention and control groups before the experiment starts. In a welldesigned and correctly implemented Randomized Controlled Trial (RCT) a simple comparison of average outcomes in the two groups can adequately resolve the attribution problem and yield accurate estimates of the impact of the intervention on a variable of interest. By design, the only difference between the two groups was the intervention.” [p.24] 4 OK, that’s enough of an initial introduction to the theory of Impact Evaluation. More reference is provided at the end of this presentation for further research on IE Let’s begin to consider some implications of trying to conduct Impact Evaluations in the RealWorld. 5 Applying Impact Evaluation to this Workshop All of you in this room have chosen to participate in this workshop. We’ll call the subject matter we cover during this workshop the intervention. We presume that you are all adequately qualified to take this course. So here is what we propose to do: We will issue a pre-test to determine how well you know this subject matter. Then we will randomly choose 50% of you whom we will identify as our control group. We will ask you to leave this room and do whatever else you want to do during the next two days. Then we will ask you to come back at 4:30 pm tomorrow to take the post-test, along with those who stayed and participated in the rest of this workshop (our intervention group). Thus we will have a counterfactual against which to measure any measurable ‘impact’ of what we taught the workshop 6 Applying Impact Evaluation to this Workshop OK, just kidding, to make a point! 7 Applying Impact Evaluation to this Workshop What are reasons why attempting to use a Randomized Control Trial to evaluate the impact of this workshop might not be a good idea? … 8 Workshop Objectives 1. The basics of the RealWorld Evaluation approach for addressing common issues and constraints faced by evaluators such as: when the evaluator is not called in until the project is nearly completed and there was no baseline nor comparison group; or where the evaluation must be conducted with inadequate budget and insufficient time; and where there are political pressures and expectations for how the evaluation should be conducted and what the conclusions should say; 9 Workshop Objectives 2. 3. 4. 5. 6. Defining what impact evaluation should be; Identifying and assessing various design options that could be used in a particular evaluation setting; Ways to reconstruct baseline data when the evaluation does not begin until the project is well advanced or completed; How to account for what would have happened without the project’s interventions: alternative counterfactuals The advantages of using mixed-methods designs 10 Workshop Objectives Note: In this workshop we focus on projectlevel impact evaluations. There are, of course, many other purposes, scopes, evaluands and types of evaluations. Some of these methods may apply to them, but our examples will be based on project impact evaluations, most of them in the context of developing countries. 11 Workshop agenda Day 1 1. 2. 3. 4. 5. 6. 7. 8. 9. Introduction [10 minutes] Brief overview of the RealWorld Evaluation (RWE) approach [30 minutes] Rapid review of evaluation designs, logic models, tools and techniques for RealWorld impact evaluations, with an emphasis on impact evaluation (IE); overview of range of evaluation models for IE compared to other types of evaluations [75 minutes] --- short break [20 minutes]--Small group introductions and sharing of RWE types of constraints you have faced in your own practice, and how they were addressed. [45 minutes] What to do if there had been no baseline survey: reconstructing a baseline [30 minutes] --- lunch [60 minutes] --Small group exercise Part I: read case studies and begin discussions [45 minutes] Small group exercise Part II: 'Clients’ and ‘Consultants’ re-negotiate the case study evaluation ToR [45 minutes] Feedback from exercise [20 minutes] Wrap-up discussion, evaluation of Day 1, plans for Day 2 [40 minutes] Workshop agenda Day 2 1. 2. 3. 4. 5. 6. 7. 8. 9. Review and reflections on topics addressed in Day 1 [25 minutes] Quantitative, qualitative and mixed methods [30 minutes] How to account for what would have happened without the project: alternative counterfactuals [30 minutes] --- short break [20 minutes]--Considerations for more holistic approaches to impact evaluation [45 minutes] Check lists for assessing and addressing threats to validity [20 minutes] Plenary discussion: Practical realities of applying RWE approaches: challenges and strategies [30 minutes] --- lunch [60 minutes] --Small-group work applying RealWorld Evaluation methodologies for the Norad Nepal evaluation case study [120 minutes] Next steps: how to apply what we’ve learned in the real worlds of our work [20 minutes] Workshop evaluation and wrap-up [20 minutes] RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints OVERVIEW OF THE RWE APPROACH 14 RealWorld Evaluation Scenarios Scenario 1: Evaluator(s) not brought in until near end of project For political, technical or budget reasons: • There was no life-of-project evaluation plan • There was no baseline survey • Project implementers did not collect adequate data on project participants at the beginning nor during the life of the project • It is difficult to collect data on comparable control groups 15 RealWorld Evaluation Scenarios Scenario 2: The evaluation team is called in early in the life of the project But for budget, political or methodological reasons: The ‘baseline’ was a needs assessment, not comparable to eventual evaluation It was not possible to collect baseline data on a comparison group 16 Reality Check – Real-World Challenges to Evaluation • • • • • • All too often, project designers do not think evaluatively – evaluation not designed until the end There was no baseline – at least not one with data comparable to evaluation There was/can be no control/comparison group. Limited time and resources for evaluation Clients have prior expectations for what they want evaluation findings to say Many stakeholders do not understand evaluation; distrust the process; or even see it as a threat (dislike of being judged) 17 RealWorld Evaluation Quality Control Goals Achieve maximum possible evaluation rigor within the limitations of a given context Identify and control for methodological weaknesses in the evaluation design Negotiate with clients trade-offs between desired rigor and available resources Presentation of findings must acknowledge methodological weaknesses and how they affect generalization to broader populations 18 The Need for the RealWorld Evaluation Approach As a result of these kinds of constraints, many of the basic principles of rigorous impact evaluation design (comparable pre-test -- post test design, control group, adequate instrument development and testing, random sample selection, control for researcher bias, thorough documentation of the evaluation methodology etc.) are often sacrificed. 19 The RealWorld Evaluation Approach An integrated approach to ensure acceptable standards of methodological rigor while operating under real-world budget, time, data and political constraints. See the RealWorld Evaluation book or at least condensed summary for more details 20 EDI T I O N This book addresses the challenges of conducting program evaluations in real-world contexts where evaluators and their clients face budget and time constraints and where critical data may be missing. The book is organized around a seven-step model developed by the authors, which has been tested and refined in workshops and in practice. Vignettes and case studies—representing evaluations from a variety of geographic regions and sectors—demonstrate adaptive possibilities for small projects with budgets of a few thousand dollars to large-scale, long-term evaluations of complex programs. The text incorporates quantitative, qualitative, and mixed-method designs and this Second Edition reflects important developments in the field over the last five years. New to the Second Edition: Adds two new chapters on organizing and managing evaluations, including how to strengthen capacity and promote the institutionalization of evaluation systems Includes a new chapter on the evaluation of complex development interventions, with a number of promising new approaches presented Incorporates new material, including on ethical standards, debates over the “best” evaluation designs and how to assess their validity, and the importance of understanding settings Expands the discussion of program theory, incorporating theory of change, contextual and process analysis, multi-level logic models, using competing theories, and trajectory analysis Provides case studies of each of the 19 evaluation designs, showing how they have been applied in the field “This book represents a significant achievement. The authors have succeeded in creating a book that can be used in a wide variety of locations and by a large community of evaluation practitioners.” RealWorld Evaluation RealWorld Evaluation Bamberger Rugh Mabry 2 —Michael D. Niles, Missouri Western State University “This book is exceptional and unique in the way that it combines foundational knowledge from social sciences with theory and methods that are specific to evaluation.” —Gary Miron, Western Michigan University “The book represents a very good and timely contribution worth having on an evaluator’s shelf, especially if you work in the international development arena.” —Thomaz Chianca, independent evaluation consultant, Rio de Janeiro, Brazil 2 EDITIO N Working Under Budget, Time, Data, and Political Constraints Michael Bamberger Jim Rugh Linda Mabry EDITION The RealWorld Evaluation approach Developed to help evaluation practitioners and clients • managers, funding agencies and external consultants Still a work in progress (we continue to learn more through workshops like this) Originally designed for developing countries, but equally applicable in industrialized nations 22 Special Evaluation Challenges in Developing Countries Unavailability of needed secondary data Scarce local evaluation resources Limited budgets for evaluations Institutional and political constraints Lack of an evaluation culture (though evaluation associations are addressing this) Many evaluations are designed by and for external funding agencies and seldom reflect local and national stakeholder priorities 23 Expectations for « rigorous » evaluations Despite these challenges, there is a growing demand for methodologically sound evaluations which assess the impacts, sustainability and replicability of development projects and programs. (We’ll be talking more about that later.) 24 Most RealWorld Evaluation tools are not new— but promote a holistic, integrated approach Most of the RealWorld Evaluation data collection and analysis tools will be familiar to experienced evaluators. What we emphasize is an integrated approach which combines a wide range of tools adapted to produce the best quality evaluation under RealWorld constraints. 25 What is Special About the RealWorld Evaluation Approach? There is a series of steps, each with checklists for identifying constraints and determining how to address them These steps are summarized on the following slide and then the more detailed flow-chart … 26 The Steps of the RealWorld Evaluation Approach Step 1: Planning and scoping the evaluation Step 2: Addressing budget constraints Step 3: Addressing time constraints Step 4: Addressing data constraints Step 5: Addressing political constraints Step 6: Assessing and addressing the strengths and weaknesses of the evaluation design Step 7: Helping clients use the evaluation 27 The Real-World Evaluation Approach Step 1: Planning and scoping the evaluation A. Defining client information needs and understanding the political context B. Defining the program theory model C. Identifying time, budget, data and political constraints to be addressed by the RWE D. Selecting the design that best addresses client needs within the RWE constraints Step 2 Addressing budget constraints A. Modify evaluation design B. Rationalize data needs C. Look for reliable secondary data D. Revise sample design E. Economical data collection methods Step 3 Addressing time constraints All Step 2 tools plus: F. Commissioning preparatory studies G. Hire more resource persons H. Revising format of project records to include critical data for impact analysis. I. Modern data collection and analysis technology Step 6 Assessing and addressing the strengths and weaknesses of the evaluation design An integrated checklist for multi-method designs A. Objectivity/confirmability B. Replicability/dependability C. Internal validity/credibility/authenticity D. External validity/transferability/fittingness Step 4 Addressing data constraints A. Reconstructing baseline data B. Recreating comparison groups C. Working with non-equivalent comparison groups D. Collecting data on sensitive topics or from difficult to reach groups E. Multiple methods Step 5 Addressing political influences A. Accommodating pressures from funding agencies or clients on evaluation design. B. Addressing stakeholder methodological preferences. C. Recognizing influence of professional research paradigms. Step 7 Helping clients use the evaluation A. Utilization B. Application C. Orientation D. Action 28 We will not have time in this workshop to cover all those steps We will focus on: Scoping the evaluation Evaluation designs Logic models Reconstructing baselines Mixed methods Alternative counterfactuals Realistic, holistic impact evaluation Challenges and Strategies 29 Planning and Scoping the Evaluation Understanding client information needs Defining the program theory model Preliminary identification of constraints to be addressed by the RealWorld Evaluation 30 Understanding client information needs Typical questions clients want answered: Is the project achieving its objectives? Is it having desired impact? Are all sectors of the target population benefiting? Will the results be sustainable? Which contextual factors determine the degree of success or failure? 31 Understanding client information needs A full understanding of client information needs can often reduce the types of information collected and the level of detail and rigor necessary. However, this understanding could also increase the amount of information required! 32 Still part of scoping: Other questions to answer as you customize an evaluation Terms of Reference (ToR): 1. 2. 3. 4. Who asked for the evaluation? (Who are the key stakeholders)? What are the key questions to be answered? Will this be mostly a developmental, formative or summative evaluation? Will there be a next phase, or other projects designed based on the findings of this evaluation? 33 Other questions to answer as you customize an evaluation ToR: 5. 6. 7. 8. 9. What decisions will be made in response to the findings of this evaluation? What is the appropriate level of rigor? What is the scope / scale of the evaluation / evaluand (thing to be evaluated)? How much time will be needed / available? What financial resources are needed / available? 34 Other questions to answer as you customize an evaluation ToR: 10. 11. 12. 13. 14. 15. Should the evaluation rely mainly on quantitative or qualitative methods? Should participatory methods be used? Can / should there be a household survey? Who should be interviewed? Who should be involved in planning / implementing the evaluation? What are the most appropriate forms of communicating the findings to different stakeholder audiences? 35 Evaluation (research) design? Key questions? Evaluand (what to evaluate)? Qualitative? Quantitative? Scope? Appropriate level of rigor? Resources available? Time available? Skills available? Participatory? Extractive? Evaluation FOR whom? Does this help, or just confuse things more? Who said evaluations (like life) would be easy?!! 36 Before we return to the RealWorld steps, let’s gain a perspective on levels of rigor, and what a life-of-project evaluation plan could look like 37 Different levels of rigor depends on source of evidence; level of confidence; use of information Objective, high precision – but requiring more time & expense Level 5: A very thorough research project is undertaken to conduct indepth analysis of situation; P= +/- 1% Book published! Level 4: Good sampling and data collection methods used to gather data that is representative of target population; P= +/- 5% Decision maker reads full report Level 3: A rapid survey is conducted on a convenient sample of participants; P= +/- 10% Decision maker reads 10-page summary of report Level 2: A fairly good mix of people are asked their perspectives about project; P= +/- 25% Decision maker reads at least executive summary of report Level 1: A few people are asked their perspectives about project; P= +/- 40% Decision made in a few minutes Level 0: Decision-maker’s impressions based on anecdotes and sound bytes heard during brief encounters (hallway gossip), mostly intuition; Level of confidence +/- 50%; Decision made in a few seconds Quick & cheap – but subjective, sloppy 38 CONDUCTING AN EVALUATION IS LIKE LAYING A PIPELINE QUALITY OF INFORMATION GENERATED BY AN EVALUATION DEPENDS UPON LEVEL OF RIGOR OF ALL COMPONENTS AMOUNT OF “FLOW” (QUALITY) OF INFORMATION IS LIMITED TO THE SMALLEST COMPONENT OF THE SURVEY “PIPELINE” Determining appropriate levels of precision for events in a life-of-project evaluation plan High rigor Same level of rigor 4 Final evaluation Baseline study Mid-term evaluation 3 Needs assessment Special Study Annual self-evaluation 2 Low rigor Time during project life cycle 41 RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints EVALUATION DESIGNS 42 So what should be included in a “rigorous impact evaluation”? 1. Direct cause-effect relationship between one output (or a very limited number of outputs) and an outcome that can be measured by the end of the research project? Pretty clear attribution. … OR … 2. Changes in higher-level indicators of sustainable improvement in the quality of life of people, e.g. the MDGs (Millennium Development Goals)? More significant but much more difficult to assess direct attribution. 43 So what should be included in a “rigorous impact evaluation”? OECD-DAC (2002: 24) defines impact as “the positive and negative, primary and secondary long-term effects produced by a development intervention, directly or indirectly, intended or unintended. These effects can be economic, sociocultural, institutional, environmental, technological or of other types”. Does it mention or imply direct attribution? Or point to the need for counterfactuals or Randomized Control Trials (RCTs)? 44 Some of the purposes for program evaluation Formative: learning and improvement including early identification of possible problems Knowledge generation: identify cause-effect correlations and generic principles about effectiveness. Accountability: to demonstrate that resources are used efficiently to attain desired results Summative judgment: to determine value and future of program Developmental evaluation: adaptation in complex, emergent and dynamic conditions -- Michael Quinn Patton, Utilization-Focused Evaluation, 4th edition, pages 139-140 45 Determining appropriate (and feasible) evaluation design Based on the main purpose for conducting an evaluation, an understanding of client information needs, required level of rigor, and what is possible given the constraints, the evaluator and client need to determine what evaluation design is required and possible under the circumstances. 46 Some of the considerations pertaining to evaluation design 1: When evaluation events take place (baseline, midterm, endline) 2. Review different evaluation designs (experimental, quasi-experimental, other) 3. Levels of rigor 4. Qualitative & quantitative methods 5. A life-of-project evaluation design perspective. 47 An introduction to various evaluation designs Illustrating the need for quasi-experimental longitudinal time series evaluation design Project participants Comparison group baseline scale of major impact indicator end of project evaluation post project evaluation 48 OK, let’s stop the action to identify each of the major types of evaluation (research) design … … one at a time, beginning with the most rigorous design. 49 First of all: the key to the traditional symbols: X = Intervention (treatment), I.e. what the project does in a community O = Observation event (e.g. baseline, mid-term evaluation, end-of-project evaluation) P (top row): Project participants C (bottom row): Comparison (control) group Note: the 7 RWE evaluation designs are laid out on page 8 of the Condensed Overview of the RealWorld Evaluation book 50 Design #1: Longitudinal Quasi-experimental P1 X C1 P2 X C2 P3 P4 C3 C4 Project participants Comparison group baseline midterm end of project evaluation post project evaluation 51 Design #2: Quasi-experimental (pre+post, with comparison) P1 X P2 C1 C2 Project participants Comparison group baseline end of project evaluation 52 Design #2+: Randomized Control Trial P1 X P2 C1 C2 Project participants Research subjects randomly assigned either to project or control group. Control group baseline end of project evaluation 53 Design #3: Truncated Longitudinal X P1 X C1 P2 C2 Project participants Comparison group midterm end of project evaluation 54 Design #4: Pre+post of project; post-only comparison P1 X P2 C Project participants Comparison group baseline end of project evaluation 55 Design #5: Post-test only of project and comparison X P C Project participants Comparison group end of project evaluation 56 Design #6: Pre+post of project; no comparison P1 X P2 Project participants baseline end of project evaluation 57 Design #7: Post-test only of project participants X P Project participants end of project evaluation 58 See Table 2.2 on page 8 of Condensed Overview of RWE D e s i g n T4 cont.) (endline) (ex-post) X P3 C3 P4 C4 X P2 C2 X P2 C2 X X P2 C2 X X P1 C1 X X P2 X X P1 X (baseline) (intervention) 1 P1 C1 X 2 P1 C1 X 3 4 X P1 5 6 7 P1 T2 X T3 T1 (midterm) P2 C2 P1 C1 (intervention, 59 “Non-Experimental” Designs [NEDs] NEDs are impact evaluation designs that do not include a matched comparison group Outcomes and impacts assessed without a conventional statistical counterfactual to address the question • “what would have been the situation of the target population if the project had not taken place?” 60 Situations in which an NED may be the best design option Not possible to define a comparison group When the project involves complex processes of behavioral change Complex, evolving contexts Outcomes not known in advance Many outcomes are qualitative Projects operate in different local settings When it is important to study implementation Project evolves slowly over a long period of time 61 Some potentially strong NEDs A. B. C. D. E. F. G. Interrupted time series Single case evaluation designs Longitudinal designs Mixed method case study designs Analysis of causality through program theory models Concept mapping Contribution Analysis 62 Any questions? 63 RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints LOGIC MODELS 64 Defining the program theory model All programs are based on a set of assumptions (hypothesis) about how the project’s interventions should lead to desired outcomes. Sometimes this is clearly spelled out in project documents. Sometimes it is only implicit and the evaluator needs to help stakeholders articulate the hypothesis through a logic model. 65 Defining the program theory model Defining and testing critical assumptions are essential (but often ignored) elements of program theory models. The following is an example of a model to assess the impacts of microcredit on women’s social and economic empowerment 66 Critical logic chain hypothesis for a Gender-Inclusive Micro-Credit Program Sustainability • Structural changes will lead to long-term impacts. Medium/long-term impacts • Increased women’s economic and social empowerment. • Economic and social welfare of women and their families will improve. Short-term outcomes • If women obtain loans they will start income-generating activities. • Women will be able to control the use of loans and reimburse them. Outputs • If credit is available women will be willing and able to obtain loans and technical assistance. 67 Example of threat to internal validity: The assumed causal model Increases women’s income Women join the village bank where they receive loans, learn skills and gain self-confidence WHICH ……… Increases women’s control over household resources WHICH … An alternative causal model Women’s income and control over household resources increased as a combined result of literacy, selfconfidence and loans Some women had previously taken literacy training which increased their selfconfidence and work skills Women who had taken literacy training are more likely to join the village bank. Their literacy and selfconfidence makes them more effective entrepreneurs Consequences Consequences Consequences PROBLEM PRIMARY CAUSE 1 Secondary cause 2.1 Tertiary cause 2.2.1 PRIMARY CAUSE 2 Secondary cause 2.2 Tertiary cause 2.2.2 PRIMARY CAUSE 3 Secondary cause 2.3 Tertiary cause 2.2.3 Consequences Consequences Consequences DESIRED IMPACT OUTCOME 1 OUTPUT 2.1 OUTCOME 2 OUTPUT 2.2 OUTCOME 3 OUTPUT 2.3 Intervention Intervention Intervention 2.2.1 2.2.2 2.2.3 Reduction in poverty Women empowered Women in leadership roles Improved educational policies Parents persuaded to send girls to school Young women educated Economic opportunities for women Female enrollment rates increase Curriculum improved Schools built School system hires and pays teachers To have synergy and achieve impact all of these need to address the same target population. Program Goal: Young women educated Advocacy Project Goal: Improved educational policies enacted ASSUMPTION (that others will do this) Construction Project Goal: More classrooms built Teacher Education Project Goal: Improve quality of curriculum OUR project PARTNER will do this Program goal at impact level What does it take to measure indicators at each level? Impact :Population-based survey (baseline, endline evaluation) Outcome: Change in behavior of participants (can be surveyed annually) Output: Measured and reported by project staff (annually) Activities: On-going (monitoring of interventions) Inputs: On-going (financial accounts) We need to recognize which evaluative process is most appropriate for measurement at various levels • Impact • Outcomes • Output • Activities • Inputs IMPACT EVALUATION PROJECT EVALUATION PERFORMANCE MONITORING One form of Program Theory (Logic) Model Economic context in which the project operates Design Inputs Institutional and operational context Political context in which the project operates Implementation Process Outputs Outcomes Impacts Sustainability Socio-economic and cultural characteristics of the affected populations Note: The orange boxes are included in conventional Program Theory Models. The addition of the blue boxes provides the recommended more complete analysis. 76 77 Education Intervention Logic Output Clusters Institutional Management Specific Impact Outcomes Better Allocation of Educational Resources Increased Affordability of Education Quality of Education Economic Growth Skills and Learning Enhancement MDG 2 Equitable Access to Education Improved Participation in Society Education Facilities MDG 3 Poverty Reduction MDG 1 Social Development MDG 2 Health Global Impacts Improved Family Planning & Health Awareness Curricula & Teaching Materials Teacher Recruitment & Training Intermediate Impacts Greater Income Opportunities Optimal Employment Source: OECD/DAC Network on Development Evaluation Expanding the results chain for multi-donor, multicomponent program Impacts Intermediate outcomes Outputs Inputs Increased rural H/H income Increased production Credit for small farmers Donor Increased political participation Access to offfarm employment Rural roads Government Improved education performance Increased school enrolment Improved health Increased use of health services Schools Health services Other donors Attribution gets very difficult! Consider plausible contributions each makes. TIME FOR SMALL GROUP DISCUSSION 80 1. Self-introductions 2. What constraints of these types have you faced in your evaluation practice? 3. How did you cope with them? 81 RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints Where there was no baseline Ways to reconstruct baseline conditions A. B. C. D. E. Secondary data Project records Recall Key informants Participatory methods 83 Ways to reconstruct baseline conditions E. PRA (Participatory Rapid Appraisal) and PLA (Participatory Learning and Action) and other participatory techniques such as timelines and critical incidents to help establish the chronology of important changes in the community 84 Assessing the utility of potential secondary data Reference period Population coverage Inclusion of required indicators Completeness Accuracy Free from bias 85 Examples of secondary data to reconstruct baselines Census Other surveys by government agencies Special studies by NGOs, donors University research studies Mass media (newspapers, radio, TV) External trend data that might have been monitored by implementing agency 86 Using internal project records Types of data Feasibility/planning studies Application/registration forms Supervision reports Management Information System (MIS) data Meeting reports Community and agency meeting minutes Progress reports Construction, training and other implementation records, including costs 87 Assessing the reliability of project records Who collected the data and for what purpose? Were they collected for record-keeping or to influence policymakers or other groups? Do monitoring data only refer to project activities or do they also cover changes in outcomes? Were the data intended exclusively for internal use? For use by a restricted group? Or for public use? 88 Typical kinds of information for which we try to reconstruct baseline data School attendance and time/cost of travel Sickness/use of health facilities Income and expenditures Community/individual knowledge and skills Social cohesion/conflict Water usage/quality/cost Periods of stress Travel patterns 89 Limitations of recall Generally not reliable for precise quantitative data Sample selection bias Deliberate or unintentional distortion Few empirical studies (except on expenditure) to help adjust estimates 90 Sources of bias in recall Who provides the information Under-estimation of small and routine expenditures “Telescoping” of recall concerning major expenditures Distortion to conform to accepted behavior: • • • Intentional or unconscious Romanticizing the past Exaggerating (e.g. “We had nothing before this project came!”) Contextual factors: • • Time intervals used in question Respondents expectations of what interviewer wants to know Implications for the interview protocol 91 Improving the validity of recall Conduct small studies to compare recall with survey or other findings. Ensure all relevant groups interviewed Triangulation Link recall to important reference events • Elections • Drought/flood/tsunami/war/displacement • Construction of road, school etc 92 Key informants Not just officials and high status people Everyone can be a key informant on their own situation: • Single mothers • Factory workers • Users of public transport • Sex workers • Street children 93 Guidelines for key-informant analysis Triangulation greatly enhances validity and understanding Include informants with different experiences and perspectives Understand how each informant fits into the picture Employ multiple rounds if necessary Carefully manage ethical issues 94 PRA and related participatory techniques PRA (Participatory Rapid Appraisal) and PLA (Participatory Learning and Action) techniques collect data at the group or community [rather than individual] level Can either seek to identify consensus or identify different perspectives Risk of bias: • If only certain sectors of the community • participate If certain people dominate the discussion 95 Summary of issues in baseline reconstruction Variations in reliability of recall Memory distortion Secondary data not easy to use Secondary data incomplete or unreliable Key informants may distort the past 96 Enough of our presentations: it’s time for you (THE RealWorld PEOPLE!) to get involved yourselves. Time for smallgroup work. Read your case studies and begin your discussions. 99 Small group case study work 1. 2. 3. Some of you are playing the role of evaluation consultants, others are clients commissioning the evaluation. Decide what your group will propose to do to address the given constraints/ challenges. Prepare to negotiate the ToR with the other group (later this afternoon). The purpose of this exercise to gain some practical ‘feel’ for applying what we’ve learned about RealWorld Evaluation. Group A (consultants) Evaluation team to consider how they will propose a revised evaluation design and plan that reduces the budget by 25% to 50%, and yet meets the needs of both clients (City Housing Department and the international donor). Group B (clients) will also review the initial proposal in light of what they have learned about RealWorld Evaluation, and prepare to re-negotiate the plans with the consultancy group. Note: there are two types of clients: the Housing Department (project implementers) and the international donor (foundation). Groups are given 45 minutes to prepare their cases. Later the Consultants’ group will meet with Clients’ groups to negotiate their proposed revisions in the plans for this evaluation. 45 minutes will be available for those negotiation sessions. Time for consultancy teams to meet with clients to negotiate the revised ToRs for the evaluation of the housing project. 103 What did you learn about RealWorld Evaluation by getting involved in that case study and roleplaying exercise? 104 Wrap-up of Day One 1. 2. How are you feeling about this workshop at this point? Tired? 105 Wrap-up of Day One 1. 2. How are you feeling about this workshop at this point? Energized?!! 106 Wrap-up of Day One 2. 3. What would you want us to do differently tomorrow? Please do read the Nepal case study as we will be using it as our practical exercise tomorrow. 107 RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints Reflections on Day One RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints Mixed-method evaluations Quantitative data collection methods Structured surveys (household, farm, transport usage, etc) Structured observation Anthropometric methods Aptitude and behavioral tests Indicators that can be counted 110 Quantitative data collection methods Strengths and weaknesses Strengths Generalization statistically representative Estimate magnitude and distribution of impacts Clear documentation of methods Standardized approach Statistical control of bias and external factors Weaknesses Surveys cannot capture many types of information Do not work for difficult to reach groups No analysis of context Survey situation may alienate Long delay in obtaining results Data reduction loses information 111 Qualitative data collection methods Interviewing Structured Semi-structured Unstructured Focus groups Community interviews PRA Audio recording Analysis of Documents and Artifacts Observation Participant observation Structured observation Unstructured observation Photography and video recording Project documents Published reports E-mail Legal documents: • • • birth and death certificates, property transfer documents marriage certificates Posters Decorations in the house Clothing and gang insignia 112 Qualitative data collection methods Characteristics The researcher’s perspective is an integral part of what is recorded about the social world Scientific detachment is not possible Meanings given to social phenomena and situations must be understood Programs cannot be studied independently of their context Difficult to define clear cause and effect Change must be studied holistically 113 Using Qualitative methods to improve the Evaluation design and results Use recall to reconstruct the pre-test situation Interview key informants to identify other changes in the community or in gender relations Conduct interviews or focus groups with women and men to • • assess the effect of loans on gender relations within the household, such as changes in control of resources and decision-making identify other important results or unintended consequences: • increase in women’s work load, • increase in incidence of gender-based or domestic violence 114 It should NOT be a fight between pure QUALITATIVE (verbiage alone) Quantoid! OR QUANTITATIVE (numbers alone) Qualoid! 115 “Your numbers “Your human look impressive, interest story but let me tell sounds nice, but you the human let me show you interest story.” the statistics.” 116 What’s needed is the right combination of BOTH QUALITATIVE methods AND QUANTITATIVE methods 117 Mixed method evaluation designs Combine the strengths of both QUANT and QUAL approaches One approach ( QUANT or QUAL) is often dominant and the other complements it Can have both approaches equal but harder to design and manage. Can be used sequentially or concurrently 118 Mixed method evaluation designs How quantitative and qualitative methods complement each other A. Broaden the conceptual framework • Combining theories from different disciplines: • Exploratory QUAL studies can help define framework B. Combine generalizability with depth and context • Random [QUANT] subject selection ensures representativity and • generalizability [QUAL] Case studies, focus groups etc. can help understand the characteristics of the different groups selected in the sample C. Permit access to difficult to reach groups [QUAL] • PRA, focus groups, case studies etc. can be effective ways to reach women, • ethnic minorities and other vulnerable groups Direct observation can provide information on groups difficult to interview. For example, informal sector and illegal economic activities D. Enable Process analysis [QUAL] • Observation, focus groups and informal conversations are more effective for understanding group processes or interaction between people and public agencies, and studying the organization 119 Mixed method evaluation designs How quantitative and qualitative methods complement each other (cont.) E. Analysis and control for underlying structural factors [QUANT] • • Sampling and statistical analysis can avoid misleading conclusions Propensity scores and multivariate analysis can statistically control for differences between project and control groups Example: • • Meetings with women may suggest gender biases in local firms’ hiring practices; however, Using statistical analysis to control for years of education or experience may show there are no differences in hiring policies for workers with comparable qualifications Example: • • Participants who volunteer to attend a focus group may be strongly in favor or opposed to a certain project, but A rapid sample survey may show that most community residents have different views 120 3. Mixed method evaluation designs How quantitative and qualitative methods complement each other (cont.) F. Triangulation and consistency checks • • Direct observation may identify inconsistencies in interview responses Examples: • • A family may say they are poor but observation shows they have new furniture, good clothes etc. A woman may say she has no source of income, but an early morning visit may show she operates an illegal beer brewing business G. Broadening the interpretation of findings: • • Combining personal experience with “social facts” Statistical analysis frequently includes unexpected or interesting findings which cannot be explained through the statistics. Rapid follow-up visits may help explain the findings 121 Mixed method evaluation designs How quantitative and qualitative methods complement each other (cont.) H. Interpreting findings Example: • A QUANT survey of community water management in Indonesia found that with only one exception all village water supply was managed by women • Follow-up [QUAL] visits found that in the one exceptional village women managed a very profitable dairy farming business – so men were willing to manage water to allow women time to produce and sell dairy produce Source: Brown (2000) 122 Determining appropriate precision and mix of multiple methods Nutritional measurements Nutritional measurements HH surveys Focus Groups HH surveys Focus Groups Key Informant interviews Large group Low rigor, questionable quality, quick and cheap Participatory --- Qualitative Extractive --- Quantitative High rigor, high quality, more time & expense Participatory approaches should be used as much as possible but even they should be used with appropriate rigor: how many (and which) people’s perspectives contributed to the story? 124 RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints Determining Counterfactuals Attribution and counterfactuals How do we know if the observed changes in the project participants or communities • income, health, attitudes, school attendance. etc are due to the implementation of the project • credit, water supply, transport vouchers, school construction, etc or to other unrelated factors? • changes in the economy, demographic movements, other development programs, etc 126 The Counterfactual What change would have occurred in the relevant condition of the target population if there had been no intervention by this project? 127 Where is the counterfactual? After families had been living in a new housing project for 3 years, a study found average household income had increased by an 50% Does this show that housing is an effective way to raise income? 128 I n c o m e Comparing the project with two possible comparison groups Project group. 50% increase 750 Scenario 2. 50% increase in comparison group income. No evidence of project impact 500 Scenario 1. No increase in comparison group income. Potential evidence of project impact 250 2004 2009 Control group and comparison group Control group = randomized allocation of subjects to project and non-treatment group Comparison group = separate procedure for sampling project and non-treatment groups that are as similar as possible in all aspects except the treatment (intervention) 130 IE Designs: Experimental Designs Randomized Control Trials Eligible individuals, communities, schools etc are randomly assigned to either: • The project group (that receives the services) or • The control group (that does not have access to the project services) 131 Primary Outcome A graphical illustration of an ‘ideal’ counterfactual using pre-project trend line then RCT Subjects randomly assigned either to … Intervention IMPACT Impact Treatment group Control group Time 132 There are other methods for assessing the counterfactual Reliable secondary data that depicts relevant trends in the population Longitudinal monitoring data (if it includes non-reached population) Qualitative methods to obtain perspectives of key informants, participants, neighbors, etc. 133 Ways to reconstruct comparison groups Judgmental matching of communities When there is phased introduction of project services beneficiaries entering in later phases can be used as “pipeline” comparison groups Internal controls when different subjects receive different combinations and levels of services 134 Using propensity scores and other methods to strengthen comparison groups Propensity score matching Rapid assessment studies can compare characteristics of project and comparison groups using: • • • • • Observation Key informants Focus groups Secondary data Aerial photos and GIS data 135 Issues in reconstructing comparison groups Project areas often selected purposively and difficult to match Differences between project and comparison groups - difficult to assess whether outcomes were due to project interventions or to these initial differences Lack of good data to select comparison groups Contamination (good ideas tend to spread!) Econometric methods cannot fully adjust for initial differences between the groups [unobservables] 136 What experiences have you had with identifying counterfactual data? 137 RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints More holistic approaches to Impact Evaluation 139 Let’s talk about the challenges of conducting impact evaluations in the real world. 140 Some recent developments in impact evaluation in development 2003 2006 J-PAL is best understood as a network of affiliated researchers … united by their use of the randomized trial methodology… 2008 Impact Evaluation for Improving Development – 3ie and AfrEA conference in Cairo March 2009 2012 Stern, E., N. Stame, J. Mayne, K. Forss, R. Davies and B. Befani. Broadening the Range of Designs and Methods for Impact Evaluations DFID Working Paper 38 2009 141 So, are we saying that Randomized Control Trials (RCTs) are the Gold Standard and should be used in most if not all program impact evaluations? Yes or no? Why or why not? If so, under what circumstances should they be used? If not, under what circumstances would they not be appropriate? 142 Different lenses needed for different situations in the RealWorld Simple Complicated Following a recipe Sending a rocket to the Raising a child moon Recipes are tested to assure easy replication Sending one rocket to the moon increases assurance that the next will also be a success The best recipes give There is a high degree good results every time of certainty of outcome Complex Raising one child provides experience but is no guarantee of success with the next Uncertainty of outcome remains Sources: Westley et al (2006) and Stacey (2007), cited in Patton 2008; also presented by Patricia Rodgers at Cairo impact conference 2009. 143 Evidence-based policy for simple interventions (or simple aspects): when RCTs may be appropriate Question needed for evidence-based policy What works? What interventions look like Discrete, standardized intervention Logic model Simple, direct cause effect How interventions work Pretty much the same everywhere Process needed for evidence uptake Knowledge transfer Adapted from Patricia Rogers, RMIT University 144 When might rigorous evaluations of higherlevel “impact” indicators not be needed? Complicated, complex programs where there are multiple interventions by multiple actors Projects working in evolving contexts (e.g. conflicts, natural disasters) Projects with multiple layered logic models, or unclear cause-effect relationships between outputs and higher level “vision statements” (as is often the case in the RealWorld of international development projects) 145 When might rigorous evaluations of higherlevel “impact” indicators not be needed? An approach evaluators might take is that if the correlation between intermediary effects (outcomes) and higher-level impact has been adequately established though research and previous evaluations, then assessing intermediary outcome-level indicators might suffice, as long as the contexts (internal and external conditions) can be shown to be sufficiently similar to where such causeeffect correlations have been tested. 146 Examples of cause-effect correlations that are generally accepted • Vaccinating young children with a standard set of vaccinations at prescribed ages leads to reduction of childhood diseases (means of verification involves viewing children’s health charts, not just total quantity of vaccines delivered to clinic) • Other examples … ? 147 But look at examples of what kinds of interventions have been “rigorously tested” using RCTs • Conditional cash transfers • The use of visual aids in Kenyan schools • Deworming children (as if that’s all that’s needed to enable them to get a good education) • Note that this kind of research is based on the quest for Silver Bullets – simple, most cost-effective solutions to complex problems. 148 “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.“ J. W. Tukey (1962, page 13), "The future of data analysis". Annals of Mathematical Statistics 33(1), pp. 1-67. Quoted by Patricia Rogers, RMIT University 149 “An expert is someone who knows more and more about less and less until he knows absolutely everything about nothing at all.”* *Quoted by a friend; also available at www.murphys-laws.com 150 Is that what we call “scientific method”? There is much more to impact, to rigor, and to “the scientific method” than RCTs. Serious impact evaluations require a more holistic approach. 151 Consequences Consequences Consequences DESIRED IMPACT OUTCOME 1 OUTCOME 2 OUTCOME 3 A more comprehensive design OUTPUT 2.1 OUTPUT 2.2 OUTPUT 2.3 A Simple RCT Intervention Intervention Intervention 2.2.1 2.2.2 2.2.3 There can be validity problems with RCTs Internal validity Quality issues – poor measurement, poor adherence to randomisation, inadequate statistical power, ignored differential effects, inappropriate comparisons, fishing for statistical significance, differential attrition between control and treatment groups, treatment leakage, unplanned cross-over, unidentified poor quality implementation Other issues - random error, contamination from other sources, need for a complete causal package, lack of blinding. External validity Effectiveness in real world practice, transferability to new situations Patricia Rogers, RMIT University 153 The limited use of strong evaluation designs In the RealWorld (at least of international development programs) we estimate that: • fewer than 5%-10% of project impact evaluations use a strong experimental or even quasi-experimental designs • significantly less than 5% use randomized control trials (‘pure’ experimental design) 154 What kinds of evaluation designs are actually used in the real world of international development? Findings from meta-evaluations of 336 evaluation reports of an INGO. Post-test only 59% Before-and-after 25% With-and-without 15% Other counterfactual 1% Rigorous impact evaluation should include (but is not limited to): 1) thorough consultation with and involvement by a variety of stakeholders, 2) articulating a comprehensive logic model that includes relevant external influences, 3) getting agreement on desirable ‘impact level’ goals and indicators, 4) adapting evaluation design as well as data collection and analysis methodologies to respond to the questions being asked, … Rigorous impact evaluation should include (but is not limited to): 5) adequately monitoring and documenting the process throughout the life of the program being evaluated, 6) using an appropriate combination of methods to triangulate evidence being collected, 7) being sufficiently flexible to account for evolving contexts, … Rigorous impact evaluation should include (but is not limited to): 8) using a variety of ways to determine the counterfactual, 9) estimating the potential sustainability of whatever changes have been observed, 10) communicating the findings to different audiences in useful ways, 11) etc. … The point is that the list of what’s required for ‘rigorous’ impact evaluation goes way beyond initial randomization into treatment and ‘control’ groups. To attempt to conduct an impact evaluation of a program using only one pre-determined tool is to suffer from myopia, which is unfortunate. On the other hand, to prescribe to donors and senior managers of major agencies that there is a single preferred design and method for conducting all impact evaluations can and has had unfortunate consequences for all of those who are involved in the design, implementation and evaluation of international development programs. We must be careful that in using the “Gold Standard” we do not violate the “Golden Rule”: “Judge not that you not be judged!” In other words: “Evaluate others as you would have them evaluate you.” Caution: Too often what is called Impact Evaluation is based on a “we will examine and judge you” paradigm. When we want our own programs evaluated we prefer a more holistic approach. How much more helpful it is when the approach to evaluation is more like holding up a mirror to help people reflect on their own reality: facilitated self-evaluation. To use the language of the OECD/DAC, let’s be sure our evaluations are consistent with these criteria: RELEVANCE: The extent to which the aid activity is suited to the priorities and policies of the target group, recipient and donor. EFFECTIVENESS: The extent to which an aid activity attains its objectives. EFFICIENCY: Efficiency measures the outputs – qualitative and quantitative – in relation to the inputs. IMPACT: The positive and negative changes produced by a development intervention, directly or indirectly, intended or unintended. SUSTAINABILITY is concerned with measuring whether the benefits of an activity are likely to continue after donor funding has been withdrawn. Projects need to be environmentally as well as financially sustainable. The bottom line is defined by this question: Are our programs making plausible contributions towards positive impact on the quality of life of our intended beneficiaries? Let’s not forget them! ANY QUESTIONS? 166 166 RealWorld Evaluation Designing Evaluations under Budget, Time, Data and Political Constraints Threats to Validity Checklists 1. What is validity and why does it matter? Defining validity The degree to which the evaluation findings and recommendations are supported by: The conceptual framework describing how the project is supposed to achieve its objectives Statistical techniques (including sample design) How the project and the evaluation were implemented The similarities between the project population and the wider population to which findings are generalized 169 Why validity is important Evaluations provide recommendations for future decisions and action. If the findings and interpretation are not valid: Programs which do not work may continue or even be expanded Good programs may be discontinued Priority target groups may not have access or benefit 170 RWE quality control goals The evaluator must achieve greatest possible methodological rigor within the limitations of a given context Standards must be appropriate for different types of evaluation The evaluator must identify and control for methodological weaknesses in the evaluation design. The evaluation report must identify methodological weaknesses and how these affect generalization to broader populations. 171 2. General guidelines for assessing the validity of all evaluation designs A. B. C. D. E. Confirmability Reliability Credibility Transferability Utilization 172 A. Confirmability Are the conclusions drawn from the available evidence and is the research relatively free of researcher bias? Examples: A-1: Inadequate documentation of methods and procedures A-2: Is data presented to support the conclusions and are the conclusions consistent with the findings? [Compare the executive summary with the data in the main report] 173 B. Reliability Is the process of the study consistent, reasonably stable over time and across researchers and methods? Examples: B-1: Data was only collected from people who attended focus groups or community meetings B-2: Were coding and quality checks made and did they show agreement? 174 C. Credibility Are the findings credible to the people studied and to readers? Is there an authentic picture of what is being studied? Examples: C-1: Is there sufficient information to provide a credible description of the subjects or situations studied? C-2: Was triangulation among methods and data sources systematically applied? Were findings generally consistent? What happened if they were not? 175 D. Transferability Do the conclusions fit other contexts and how widely can they be generalized? Examples: D-1: Are the characteristics of the sample described in enough detail to permit comparisons with other samples? D-2: Does the report present enough detail for readers to assess potential transferability? 176 E. Utilization Were findings useful to clients, researchers and communities studied? Examples: E-1: Were findings intellectually and physically accessible to potential users? E-2: Do the findings provide guidance for future action? 177 3. Additional threats to validity for Quasi-Experimental Designs [QED] F. Threats to statistical conclusion validity why inferences about statistical association between two variables (for example project intervention and outcome) may not be valid G. Threats to internal validity why assumptions that project interventions have caused observed outcomes may not be valid H. Threats to construct validity why selected indicators may not adequately describe the constructs and causal linkages in the evaluation model I. Threats to external validity why assumptions about the potential replicability of a project in other locations or with other groups may not be valid 178 Threats to Validity Checklists [See Appendices A-E of RWE book, pp. 490-555] • Appendix A: Worksheet for quantitative designs • Appendix B: Worksheet for qualitative designs • Appendix C: Worksheet for standard mixed-method designs • Appendix D: Example of completed Threats to Validity worksheet • Appendix E: Worksheet for advanced mixed-method designs 185 1. Let’s talk about what we’re learning. 2. What are some of the practical realities of applying RWE approaches? 192 Nepal case study exercise 1. 2. 3. 4. Imagine that you have been contracted as an consultant to the Norad Evaluation Department to serve in the role of expert advisor to help guide the planning for and reporting of this evaluation. As you read the ToR (Terms of Reference) [extracted from 205-page full report] what suggestions would you make for improving the ToR (making it more realistic), based on what you’ve learned about RealWorld Evaluation designs and methodologies? Subsequently, upon reading summary descriptions of the evaluation report (Part II), what feedback would you provide to those who conducted the evaluation? From what you have learned from this evaluation, what advice would you give to the Norad Evaluation Department (or your own agency) about designing future evaluations of this kind? 193 Next steps: How are we going to apply these things as we go forth and practice evaluation in our own Real Worlds?! 194 Workshop wrap-up and evaluation. 195 Main workshop messages 1. 2. 3. 4. 5. Evaluators must be prepared for RealWorld evaluation challenges. There is considerable experience to learn from. A toolkit of practical “RealWorld” evaluation techniques is available (see www.RealWorldEvaluation.org). Never use time and budget constraints as an excuse for sloppy evaluation methodology. A “threats to validity” checklist helps keep you honest by identifying potential weaknesses in your evaluation design and analysis. 196 197 197 Additional References for IE DFID “Broadening the range of designs and methods for impact evaluations” http://www.oecd.org/dataoecd/0/16/50399683.pdf Robert Picciotto “Experimententalism and development evaluation: Will the bubble burst?” in Evaluation journal (EES) April 2012: http://evi.sagepub.com/ Martin Ravallion “Should the Randomistas Rule?” (Economists’ Voice 2009) http://ideas.repec.org/a/bpj/evoice/v6y2009i2n6.html William Easterly “Measuring How and Why Aid Works—or Doesn't” http://aidwatchers.com/2011/05/controlled-experiments-and-uncontrollablehumans/ Control freaks: Are “randomised evaluations” a better way of doing aid and development policy? The Economist June 12, 2008 http://www.economist.com/node/11535592 Series of guidance notes (and webinars) on impact evaluation produced by InterAction (consortium of US-based INGOs): http://www.interaction.org/impact-evaluation-notes Evaluation (EES journal) Special Issue on Contribution Analysis, Vol. 18, No. 3, July 2012. www.evi.sagepub.com Impact Evaluation In Practice www.worldbank.org/ieinpractice 198