DRAFT 9/25/08 Recommendation Process Advisory Committee on Heritable Disorders and Genetic Diseases of Newborns and Children Purpose The Advisory Committee on Heritable Disorders and Genetic Diseases of Newborns and Children (the Advisory Committee) has, as one of its charges, the responsibility of making evidence-based recommendations regarding what additional tests should be added to the current core set of newborn screening tests (NBS), as well as evaluating and updating the current core set. To support the steps required to meet this responsibility, the Advisory Committee has approved a process for nominating conditions and a process for the systematic review of the evidence (SER) regarding screening for these conditions. The purpose of this document is to outline the process for creating a recommendation regarding adding tests to the core set based on the SER. Indirect chain of evidence For evaluation of the heritable conditions nominated for recommended inclusion in the uniform NBS panel of disorders, it is unlikely that the Advisory Committee will have the evidence of newborn screening or treatment for rare disorders that is widely considered by formal traditional evidence reviews to be the most reliable. Such evidence usually includes peer-reviewed, large scale, replicated intervention studies or randomized controlled trials. For most disorders, it will be necessary to consider evidence that is a compilation of less robust data such as modest-sized open label clinical studies for evaluating treatment and extrapolating from population-based observational studies, as available, when evaluating tests. This approach involves creating a chain of evidence beginning with what is known about the condition and then moving to evaluate the technical performance of the test, or analytic validity. Next, evidence regarding the strength of the association between the test result and the condition of interest, or the tests ability to accurately and reliably identify or predict the disorder or outcome of interest, or clinical validity, must be evaluated. Finally, the Advisory Committee must evaluate the test’s clinical utility, or the efficacy and effectiveness of the test in directing the management of newborns and children, balancing important health outcomes (benefits) and harms of identification and treatment. This will involve considering whether there is adequate evidence regarding the effectiveness of treatment for the condition, and adequate consideration of potential harm. While this approach is similar to those used in other evidence-based recommendation processes, we recognize that allowances must be made for the evaluation of rare disorders, which is unique in that we may not understand the clinical significance of test results, phenotype expression of detected genotypes, or the full range of potentially effective medical or other management options. Key Questions The chain of indirect evidence is put together with a set of key questions flowing from an analytic framework that conceptualizes the issue under study. These questions guide the literature search help organize the systematic evidence review, and provide a road map for the Advisory Committee for translating the evidence into a recommendation. The Advisory Committee will need to review the SER and determine if the evidence for each key question is convincing, adequate, or inadequate. Figure 1 is a generic analytic framework for use by the Advisory Committee. The numbers in the figure correspond to the key questions discussed below. Figure 1. Analytic Framework 1 Testing for condition 3 Treatment of Condition 6 General 2 population of newborns 4 Mortality, morbidity, and other outcomes Identification of condition 6 Harms of testing/identification 7 Harms of treatment/other interventions Key question 1: Is there direct evidence that screening for the condition at birth leads to improved health outcomes? (overarching question) As mentioned previously, the level of evidence to support the overarching question involves controlled intervention trials involving screen-detected individuals. Again, it is unlikely that, for many conditions considered by the Advisory Committee that there will be direct evidence available. The remaining key questions allow for the development of a chain of indirect evidence that, if adequately addressed by research, can be used to support a recommendation. Key question 2: What is known about the condition? Is the condition well-defined and important? What is the incidence of the condition in the U.S. population? What is the spectrum of disease for the condition? What is the natural history of the condition, including the impact of recognition and treatment? The question of how well and specifically defined the condition is represents an essential piece of information to guide the rest of the evidence review and consideration of any recommendation. Sufficient importance can be judged by considering both incidence of the condition and the severity of its health impact, such that a condition of lower severity can be important due to a high incidence, and a rare condition can be important due to serious health consequences. Understanding the spectrum of disease is essential in considering whether there are cases of the condition for which treatment is not effective or otherwise unwarranted, which also relates to the natural history of the condition. Key question 3: Is there a test for the condition with sufficient analytic utility and validity? Analytic utility involves the choice of the testing target or targets, the choice of testing platform, the availability of and access to testing reagents, considering issues such as whether these are commercially available, custom synthesized, “home-brewed”, and/or part of current research, and whether they have right to use (RTU) clearance. Analytic validity refers to the technical, laboratory accuracy of the test in measuring what it is intended to measure. It must be distinguished from clinical validity, which is the test’s ability to predict the development of clinical disease. For example, TMS testing may result in a pattern of acylcarnitines and/or amino acids that is associated with a certain condition. Analytic validity would deal with the sensitivity and specificity of the TMS testing protocol in accurately and reliably detecting that pattern. Types of evidence for analytic validity are different than those for clinical validity and need to address pre-analytic, analytic, and post-analytic issues. Pre-analytical issues to evaluate include sample and reagent stability. Consideration of the analytical phase involves evaluating accuracy (including method comparison), precision (both inter- and intra-assay), recovery, linearity, carry-over (if applicable), detection limits, signal suppression (if applicable, especially for MS/MS), intensity criteria (signal/noise), age and gender matched reference values (if applicable), disease range, and cutoff level defining clinical significance (required for 2nd tier test). Post-analytical issues to consider include evaluation on interpretive guidelines, used to define a case, the spectrum of differential diagnoses, and the algorithm for short term follow-up/confirmatory testing (biochemical, in vitro, and/or molecular). . The Advisory Committee will use explicit criteria for judging this evidence as adequate (acceptable quality and sufficient number of studies) or inadequate (too much uncertainty regarding the analytic validity). There is a developed detailed description of evaluating analytic validity, presented in Appendix A, which was developed in part by the EGAPP (Evaluating Genomic Applications in Practice and Prevention) Working Group (EWG) of the Centers for Disease Control and Prevention (CDC) and can serve as a starting point for discussion. It is difficult to determine in isolation what level of analytic validity should be considered sufficient, as the ramifications of errors from analytic validity are seen when evaluating clinical validity. However, analytic validity is key to the dissemination of the test. The goal, of course, would be to have very high analytic sensitivity and specificity and a high level of certainty that testing programs across the country would be able to implement use of this test with the same level of analytic validity. It is possible that evidence on clinical validity will be adequate, while evidence on analytic validity is not available or is otherwise inadequate. It may be acceptable for the Advisory Committee to make a positive recommendation to add the condition to the core set, though issues of dissemination and implementation will need to be carefully considered. Key question 4: Does the test accurately and reliably detect the condition and clinical disease? This refers to the test’s clinical validity, which refers to the ability of the test to accurately predict the development of symptomatic or clinical disease. Clinical sensitivity and specificity drive both false positives, which carry certain risks, and false negatives, which would then be detected later if and when the condition became symptomatic. Key metrics to consider for clinical utility include the sensitivity, specificity, positive predictive value and false positive rate. There are two parts to this key question: is the evidence sufficient to conclude that we know what the clinical validity is? This involves only a consideration of the strength and quality (taken together as adequacy) of the evidence in the SER to determine that we know the sensitivity and specificity of the test. The second part of this question relates to whether or not this level of clinical validity sufficient to justify testing, given the ability of the test to detect a reasonable number of affected individuals who would be expected to manifest clinical disease, the tradeoff of risks of false positives, and the benefits of early detection of true positives; these issues related to both test performance and the incidence/prevalence of the condition. Consideration must be given to the potential for individuals to test positive but not develop clinical disease. Issues of trade-offs between false positives, false negatives, and identification of non-clinical conditions all impact clinical utility. A detailed description of evaluating clinical validity, modified from the in-press article on the EWG methods, is presented in Appendix B. Key question 5: Are there available treatments for the condition that improve important health outcomes? Does treatment of the condition detected through NBS improve important health outcomes when compared with waiting until clinical detection? Are there subsets of affected children more likely to benefit from treatment that can be identified through testing or clinical findings? Are the treatments for affected children standardized, widely available, and if appropriate, FDA approved? This question refers to clinical utility, or the ability of testing for the condition to translate to improvements in important health outcomes, and to whether the potential benefits of testing, diagnosis and treatment exceed the potential harms. It involves evaluating whether there are treatments available and the effectiveness of treatment when provided for those in whom the condition would become clinically manifest, or provided in order to decrease the risk of developing clinical disease. It is important to note that treatment may include a broad list of interventions including counseling and support services, beyond the narrow definition of medical therapy. To address this question, the Advisory Committee will need to determine the value of proposed health outcomes considered. The EWG is in the process of publishing a paper on health outcomes for consideration in evidence-based recommendations for genomic tests. This list is referenced in the Secretary’s Advisory Committee on Genetics, Health, and Society (SACGHS) report, U.S. System of Oversight of Genetic Testing: A Response to the Charge of the Secretary of Health and Human Services (see table below). These outcomes are not of equal weight or value, and it is likely that a good deal of debate in Advisory Committee deliberations regarding clinical utility will involve balancing the tradeoffs between different favorable and unfavorable outcomes. A detailed description of evaluating clinical utility, modified from the in-press article on the EWG methods, is presented in Appendix C. Key questions 6 and 7: Are there harms or risks identified for the identification and/or treatment of affected children? These questions are often incompletely address in medical research, yet are key to allowing the Advisory Committee to balance the potential benefits and risks when making a recommendation regarding a condition. Included in harms are direct harms to physical health as well as other issues including labeling, anxiety, adverse impacts on parent and family relationships, and other ethical, legal, and social implications. At times, the Advisory Committee may need to estimate the degree or “upper bounds” of potential harm to support decisions regarding net benefit of testing for a condition. Key question 8: What is the estimated cost-effectiveness of testing for the condition? This question does not appear in the analytic framework diagram, but is a consideration that the Advisory Committee is specifically interested in. There is little published empiric research on the cost-effectiveness of any health care service, and we would not expect to find studies involving primary data collection on newborn screening cost-effectiveness. Instead, this question may be addressed in the literature through decision modeling, which can provide estimates that the Advisory Committee will take into consideration when adopting a recommendation. Translating evidence into recommendations categories for Advisory Committee Reports Based on the evidence report, assessment of the strength and quality of the available evidence, and consideration of other clinical and social contextual issues, the Advisory Committee will make recommendations to the Secretary of Health and Human Services regarding whether conditions should be added to the core set of those recommended for newborn screening. The information is intended to provide transparent, authoritative advice. It may also be used to promote specific research to fill in gaps in the evidence for specific conditions. There are three elements to consider in making the recommendation: magnitude of net benefit, overall adequacy of evidence, and certainty of net benefit/harm. Magnitude of net benefit Essential factors for the development of a recommendation include the relative importance of the outcomes considered; the health benefits associated testing for the condition and subsequent interventions; if health benefits are not available from the literature, then the maximum potential benefits; the harms associated with testing for the condition such as adverse clinical outcomes, increase in risk, unintended ethical, legal, and/or social issues that result from testing and subsequent interventions; if harms are not available from the literature, then the maximum potential harms; and the efficacy and effectiveness of testing for the condition and follow-up compared to current practice, which might include doing nothing. Benefits and harms may include psychosocial, familial and social outcomes. Simple decision models or outcomes tables might be helpful in assessing the magnitudes of benefits and harms, and in estimating the net effect. Consistent with the processes of other evidence-based recommendation groups, the magnitude of net benefit (benefit minus harm) can classified as at least moderate, small, or zero/net harm. For the purposes of the Advisory Committee in making recommendations, moderate or greater net benefit will considered “significant” and will support a recommendation to add the condition, and zero/harmful net benefit will support a recommendation to not add the condition. Those conditions where the magnitude of net benefit is classified as small will be discussed on a case-by-case basis and classified as either significant or not significant. A recommendation to add a condition where testing is expected to provide only small net benefit should be supported by a high degree of certainty based on the evidence (see certainty of net benefit below). Overall adequacy of evidence The adequacy of the evidence to answer the key questions can be summarized and classified across the questions as adequate or inadequate (using explicit criteria). This is also referred to as assessing the strength of the linkages in the chain of evidence. Adequate evidence would require studies of fair or better quality of at least clinical utility to support a recommendation. Insufficient evidence would include no evidence, studies of poor quality, or studies with conflicting results. There are six critical appraisal questions that should be used to determine adequacy of the evidence for each key question: 1. Do the studies have the appropriate research design to answer the key question? 2. To what extent are the studies of high quality (internal validity)? 3. To what extent are the studies generalizable to the US population (external validity)? 4. How many studies and how large have been done to answer the key question (precision of the evidence)? 5. How consistent are the studies? 6. Are there additional factors supporting conclusions? For adequate evidence to support a conclusion, there must be evidence to support most if not all of these questions satisfactorily. Certainty of net benefit Based on the summaries of the evidence for each key question and the evidence chain, the certainty of the conclusions regarding the net benefit can be classified as sufficient or low. A conclusion to either recommend adding or not adding the condition with sufficient certainty has an acceptable risk or level of comfort of “being wrong” and thus a low susceptibility to being overturned or otherwise altered by additional research. Insufficient certainty should not lead to a recommendation for or against adding the condition, but should lead to a recommendation for further research. Finally, there are likely to be conditions where the evidence is inadequate to reach a conclusion and make a recommendation based on at least fair evidence of clinical utility and significant net benefit, but contextual issues support a recommendation to add the condition with a commitment to fill in the gaps in evidence as experience with the test is gained. We recognize that these recommendations do not meet the strict criteria of evidence-based as generally accepted, but are “evidence-informed” or “evidence-supported”. Contextual issues might include things such as known benefits associated with testing (and intervention) for similar conditions, high incidence that would translate to potential substantial net benefit, availability of promising but yet unproven new therapies, or indirect evidence of perhaps lower value health outcomes but with evidence of low potential harm. These conditions will be recommended with “provisional” status. Conditions added with a provisional status should be re-evaluated at a time when sufficient numbers of tests have been performed such that observational data may be available to fill in the gaps in the evidence chain. This amount time will depend on the incidence of the condition in the populations tested. Similarly, population-based pilot studies should be developed and implemented in order to answer specific evidence gaps. These pilots must be applicable to U.S. populations. The decision whether to recommend a test provisionally or to refer for pilot studies should be made with careful considerations of the potential harms associated with the premature acceptance of unproven clinical strategies, weighed against the potential but health benefits and potential harms of waiting for more compelling evidence. Recommendations will be based on the level of certainty that testing will result in significant net health benefit, based on the evaluation of the evidence. The following matrix will guide recommendation category. Table 1: Decision Matrix for Advisory Committee Recommendations RECOMMENDATION LEVEL OF CERTAINTY Recommend adding the test to the core set Sufficient Recommend not adding the test to the core set Sufficient MAGNITUDE OF NET BENEFIT Significant Zero or net harm Recommend adding the test with “provisional” status Insufficient, but the potential for net benefit is compelling enough to add the test now, with a commitment to evaluated the experience with the test over time Potentially significant, supported by contextual considerations Recommend not adding the test now, but instead recommend pilot studies Insufficient, and additional evidence is needed to make a conclusion about net benefit Potentially significant or unknown APPENDIX A Analytic Validity The analytic validity of a newborn screening test is its ability to accurately and reliably measure the presence of a specific pattern of acylcarnitines and/or aminoacides, in the case of tandom mass spectroscopy, or a specific genetic mutation, of interest in the clinical laboratory, and in specimens representative of the population of interest. Analytic validity includes analytic sensitivity (detection rate), analytic specificity (1-false positive rate), reliability (e.g., repeatability of test results), and assay robustness (e.g., resistance to small changes in pre-analytic or analytic variables). Errors that affect analytic validity can occur throughout the testing process, and are categorized as pre-analytic, analytic, and post-analytic. Pre-analytic errors are related to samples (e.g., wrong sample type, insufficient amount, sample mislabeled at the source), sample handling (e.g., transport temperature, time in transport, mix-up/mislabeling in laboratory), and data entry. Post-analytic errors are generally related to transcription/data entry of results and laboratory reports that contain incorrect or confusing interpretations. It has been estimated that pre- and postanalytic variables are the biggest contributors to laboratory mistakes, accounting for at least twothirds of all errors. Studies performed on specimens that do not represent routinely analyzed clinical specimens, and that are not subject to all aspects of the routine testing process (e.g., sample collection, transport, processing), are not sufficient for generalizable characterization of analytic validity. Tests kits or reagents that have been cleared or approved by the Food and Drug Administration (FDA) may provide information on analytic validity that is publicly available for review (e.g., 510(k) summaries). However, a large proportion of currently available testing offerings are currently available as laboratory developed tests (LDTs), and have not been reviewed by the FDA. Consequently, information from other sources must be sought and evaluated. Different tests may use a similar methodology (such as TMS), and information regarding the analytic validity of a common technology may be informative. However, general information about the technology cannot be used as a substitute for specific information about the test under review. Below is a list of strengths and weaknesses of study designs that have been (or could be) used to obtain unbiased and reliable information about analytic validity, and provides a quality ranking of data sources. The best information would come from collaborative studies using a single large, carefully selected panel of well-characterized control samples that are blindly tested and reported, with the results independently analyzed. Data from proficiency testing schemes (Levels 1 or 2) can provide information about all three phases of analytic validity (i.e., analytic, pre- and postanalytic) and inter-laboratory variability. Hierarchy of study designs/data sources Level 1 Collaborative study using a large panel of well characterized samples Summary data from well-designed external proficiency testing schemes or inter-laboratory comparison programs Level 2 Other data from proficiency testing schemes Well-designed peer-reviewed studies (e.g., method comparisons, validation studies) Expert panel reviewed FDA summaries Level 3 Less well designed peer-reviewed studies Level 4 Unpublished and/or non-peer reviewed research, clinical laboratory or manufacturer data Studies on performance of the same basic methodology, but used to test for a different target The list below presents criteria for assessing the quality of individual studies on analytic validity. The quantity of data includes the number of reports, the total number of positive and negative controls studied, and the range of methodologies represented. The consistency of findings can be assessed formally (e.g., by testing for homogeneity), or by less formal methods (e.g., providing a central estimate and range of values) when sufficient data are lacking. One or more internally valid studies do not necessarily provide sufficient information to justify routine clinical usage. Supporting the use of a test in routine clinical practice generally requires studies that provide estimates of analytic validity that are generalizable to use in diverse “real world” settings. Also, existing data may support the reliable performance of one methodology, but no data may be available to assess the performance of one or more other methodologies. Criteria for evaluating study quality Adequate descriptions of the index test (test under evaluation) Source and inclusion of positive and negative control materials Reproducibility of test results Quality control/assurance measures Adequate descriptions of the test under evaluation Specific methods evaluated Number of positive samples and negative controls tested Adequate descriptions of the basis for the ‘right answer’ Comparison to a ’gold standard’ referent test Consensus (e.g., external proficiency testing) Characterized control materials (e.g., NIST*, sequenced) Avoidance of biases Blinded testing and interpretation Specimens represent routinely analyzed clinical specimens in all aspects (e.g., collection, transport, processing) Reporting of test failures and uninterpretable or indeterminate results Analysis of data Point estimates of analytic sensitivity and specificity with 95% confidence intervals Sample size / power calculations addressed Finally, the evidence must be examined overall and a decision regarding whether the evidence is graded as convincing, adequate or inadequate. When the quality of evidence is Convincing, the observed estimate or effect is likely to be real, rather than explained by flawed study methodology, and the conclusion is unlikely to be strongly affected by the results of future studies. When the quality of evidence is Adequate, the observed results may be influenced by flaws in study methodology and, as more information becomes available, the estimate or effect may change enough to alter the conclusion. When the quality of evidence is Inadequate, the observed results are more likely to be the result of flaws in study methodology rather than an accurate assessment, and subsequent information is more likely to change the estimate or effect enough to change the conclusion. Availability of only marginal quality studies always results in Inadequate quality. The criteria for grading evidence for analytic validity are presented below. Evidence grading for analytic validity Convincing evidence: Studies that provide confident estimates of analytic sensitivity and specificity using intended sample types from representative populations Two or more Level 1 or 2 studies that are generalizable, have a sufficient number and distribution of challenges, and report consistent results One Level 1 or 2 study that is generalizable and has an appropriate number and distribution of challenges Adequate evidence: Two or more Level 1 or 2 studies that: o lack the appropriate number and/or distribution of challenges o are consistent, but not generalizable. o Modeling showing that lower quality (Level 3, 4) studies may be acceptable for a specific well-defined clinical scenario Inadequate evidence: Combinations of higher quality studies that show important unexplained inconsistencies One or more lower quality studies (Level 3 or 4) Expert opinion APPENDIX B Clinical Validity Clinical validity of a newborn screening test may be defined as its ability to accurately and reliably predict the clinically defined disorder of interest. Clinical validity encompasses clinical sensitivity and specificity, and the disorder prevalence (the proportion of individuals in the selected setting who have, or will develop, the clinical disorder of interest). The positive and negative predictive values can be computed from the clinical sensitivity, clinical specificity and prevalence. Other variables important to clinical validity are penetrance (usually associated with genetic testing, this is the proportion of individuals with a specific genotype who manifest the specific associated phenotype; there is a similar construct for TMS patterns), expressivity (the extent to which a specific phenotype is expressed in individuals with the associated genotype or a disorder is expressed in an individual with the associated condition defined by the TMS pattern), and the genetic and environmental factors that may impact the disorder or the tests. As with analytic validity, the important characteristics defining overall quality of evidence on clinical validity are the internal validity of individual studies, the number of studies, the representativeness of the study population(s) compared to the population(s) to be tested, and the consistency and generalizability of the findings. The list below provides a hierarchy of study designs for assessing clinical validity. Hierarchy of study designs/data sources Level 1 Well designed longitudinal cohort studies Validated clinical decision rule* Level 2 Well designed case-control studies Level 3 evidence Level 3 Lower quality case-control and cross-sectional studies Unvalidated clinical decision rule* Level 4 Case series Unpublished and/or non-peer reviewed research, clinical laboratory or manufacturer data Consensus guidelines Expert opinion *A clinical decision rule is an algorithm leading to result categorization. It can also be defined as a clinical tool that quantifies the contributions made by different variables (e.g., test result, family history) in order to determine classification/ interpretation of a test result (e.g., for diagnosis, prognosis, therapeutic response) in situations requiring complex decision-making The list below provides criteria adopted for grading the internal validity of studies (e.g., study design, execution, minimizing bias). The quantity of data includes the number of studies or the number and racial/ethnic distribution of total subjects in the studies. The overall consistency of clinical validity estimates can be determined by formal methods such as meta-analysis. Minimally, estimates of clinical sensitivity and specificity should include confidence intervals. In most instances, estimates of clinical validity will be computed from small datasets focused on individuals with the disease, or from case/control studies which may, or may not, represent the wide range or frequency of results that will be found in the general population. However, when tests are to be widely applied (e.g., for screening) additional data may be needed from the general population to better quantify clinical validity prior to introduction. Criteria for assessing study quality Clear description of the disorder/ phenotype and outcomes of interest Status verified for all cases Appropriate verification of controls Verification does not rely on index test result Prevalence estimates are provided Adequate description of study design and test / methodology Adequate description of the study population Inclusion/exclusion criteria Sample size, demographics Study population defined and representative of the clinical population to be tested Allele/genotype frequencies or analyte distributions known in general and subpopulations Independent blind comparison with appropriate, credible reference standard(s) Independent of the test Used regardless of test results Description of handling of indeterminate results and outliers Blinded testing and interpretation of results Analysis of data Possible biases are identified and potential impact discussed Point estimates of clinical sensitivity and specificity with 95% confidence intervals Estimates of positive and negative predictive values Finally, the evidence must be examined overall and a decision regarding whether the evidence is graded as convincing, adequate or inadequate (see Appendix A). The criteria for grading evidence of clinical utility are presented below. Evidence grading for clinical validity Convincing evidence: Well-designed and conducted studies in representative population(s) that measure the strength of association between a genotype or biomarker and a specific and well-defined disease or phenotype Systematic review/meta-analysis of Level 1 studies with homogeneity Validated Clinical Decision Rule (CDR) High quality Level 1 cohort study Adequate evidence: Systematic review of lower quality studies Review of Level 1 or 2 studies with heterogeneity Case/control study with good reference standards Unvalidated CDR (Level 2) Inadequate evidence: Single case-control study o Non-consecutive cases o Lacks consistently applied reference standards Single Level 2 or 3 cohort/case-control study o Reference standard defined by the test or not used systematically o Study not blinded Level 4 data APPENDIX C Clinical Utility The clinical utility of a newborn screening test refers to evidence of improved measurable clinical outcomes, and its usefulness and added value to patient management decision-making compared to current management without testing. If a test has utility, it means that the results - positive or negative – provide information that is of value to the person (or sometimes the individual’s family or community) in seeking an effective treatment or preventive strategy. Clinical utility encompasses effectiveness (evidence of utility in real clinical settings), and the net benefit (the balance of benefits and harms). Frequently, it also involves assessment of efficacy (evidence of utility in controlled settings). As was the case with analytic and clinical validity, the three important quality characteristics for clinical utility are quality of individual studies and the overall body of evidence, the quantity of relevant data, and the consistency and generalizability of the findings. The lists below provide the hierarchy of study designs for clinical utility, and criteria for grading the internal validity of studies (e.g., study design, execution, minimizing bias) adopted from other published approaches. Hierarchy of study designs/data sources Level 1 Meta-analysis of randomized controlled trials (RCT) Level 2 A single randomized controlled trial Level 3 Controlled trial without randomization Cohort or case-control study Level 4 Case series Unpublished and/or non-peer reviewed studies Clinical laboratory or manufacturer data Consensus guidelines Expert opinion Criteria for assessing study quality Clear description of the outcomes of interest What was the relative importance of outcomes measured; which were prespecified primary outcomes and which were secondary? Clear presentation of the study design Was there clear definition of the specific outcomes or decision options to be studied (clinical and other endpoints)? Was interpretation of outcomes/endpoints blinded? Were negative results verified? Was data collection prospective or retrospective? If an experimental study design was used, were subjects randomized? Were intervention and evaluation of outcomes blinded? Did the study include comparison with current practice/empirical treatment (value added)? Intervention What interventions were used? What were the criteria for the use of the interventions? Analysis of data Is the information provided sufficient to rate the quality of the studies? Are the data relevant to each outcome identified? Is the analysis or modeling explicit and understandable? Are analytic methods pre-specified, adequately described, and appropriate for the study design? Were losses to follow-up and resulting potential for bias accounted for? Is there assessment of other sources of bias and confounding? Are there point estimates of impact with 95% CI? Is the analysis adequate for the proposed use? Finally, the evidence must be examined overall and a decision regarding whether the evidence is graded as convincing, adequate or inadequate (see Appendix A). The criteria for grading evidence of clinical utility are presented below. Grading evidence for clinical utility Convincing evidence: Well-designed and conducted studies in representative population(s) that assess specified health outcomes Systematic review/meta-analysis of RCTs showing consistency in results At least one large RCT (Level 2) Adequate evidence: Systematic review with heterogeneity One or more controlled trials without randomization (Level 3) Systematic review of Level 3 cohort studies with consistent results Inadequate evidence: Systematic review of Level 3 quality studies or studies with heterogeneity Single Level 3 cohort or case-control study Level 4 data