Biomarkism: taming the revolution? May 12th 2014 PSI Conference David Lovell St George’s Medical School University of London Henry Gray Edward Jenner Plan of contribution • Different types of biomarkers • Biomarkers (the journal) and statistical guidelines • Some personal editorial/refereeing experiences • Challenges • Epigenetics • Discussion points Editorial Board Member • Biomarkers • Mutagenesis • Toxicology in Vitro (until 2008) • Refereeing for numerous journals (20+) over last 10-15 years Biomarkers was started in 1996 by John Timbrell (UoL School of Pharmacy) The journal Biomarkers brings together all aspects of the rapidly growing field of biomarker research, encompassing their various uses and applications in one essential source. Manuscripts can describe biomarkers measured in humans or other animals in vivo or in vitro. FDA definition of Biological Marker (biomarker) “A characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.” "Biomarkers and Surrogate Endpoints: Preferred Definitions and Conceptual Framework" Biomarkers Definitions Working Group (2001) • Biomarkers of exposure: covering detection and measurement of internal exposure to drugs and other chemicals; • Biomarkers of response: including measures of endogenous substances or parameters indicative of pathological or biochemical changes both toxicodynamic and pharmacodynamic, resulting from exposure to drugs and other chemicals; • Biomarkers of susceptibility: including genetic factors which alter susceptibility to drugs and other chemicals ; • Biomarkers of disease: covering measurement of endogenous substances or parameters indicative of a disease process and the use of pharmacodynamic and genetic markers in evidencebased laboratory medicine and treatment (markers of efficacy) • 13 (at least) other journals with Biomarkers in the title • 8 issues/year • >400 papers received in 2012 and 2013 • Approx 225 referees in 2013 • Rejection rates have increased from 51% in 2009 to 73% in 2013. • Year average impact factor stable at 2.230 • Geographical spread Country/Year U.S. China India Italy U.K. Germany Brazil Turkey 2009 2010 54 54 33 44 25 27 20 12 14 22 13 20 9 9 5 7 2011 32 54 23 17 14 20 9 2012 36 108 35 22 11 8 8 2013 16 122 26 17 16 5 15 7 27 24 The paper "Biomarkers expects high standards but recognizes that it needs to be vigilant as scientific research continues to be affected by errors in the conduct and reporting of research and by fraudulent research. There have been reports of the high incidence of statistical errors, poor statistical practice, and limitations in the designs used in papers published by peer-review journals." Quotes from the paper Biomarkers, therefore, starts from the position that it is of paramount importance that studies published in a peer-reviewed journal should have been correctly designed, carried out, and reported, and that the results are provided in such a way that the experimental and statistical methods could be repeated. This is also important for both economic and ethical considerations. Transparency in terms of the availability of and access to the original raw data is a key component for the critical assessment of evidence-based research. Biomarkers, at present, does not have statistical guidelines. It does, though, have instructions to authors, which provide sensible general requirements. Table 1 provides a non-exhaustive list of some of the guidelines available. In many specific technical areas, such as the − omics technical reviews of appropriate statistical methods have been produced; the referee can reasonably expect that an author is aware of them, applies the methods, and cites the publications where appropriate. Biomarkers, therefore, expects to see evidence of the planning that went into a study and to see statistical analyses, which make full use of the design. Examples would be details of the statistical analysis plan (SAP), consideration of the primary endpoint, and whether the primary aim of the study was hypothesis testing or hypothesis generation. A failure to declare in the “methods” section that blinding and randomization were carried out would be interpreted as implying that this had not been done. Details, such as the type of randomization, e.g., block or stratified and methods used for blinding, should be given when relevant. Biomarkers expects to see a justification of the sample sizes used and, where relevant, the power calculations, which were carried out as part of the development of the SAP for both experimental and observational studies. uncritical use of hypothesis testing and the reporting of results as either statistically significant (p <0.05) or, preferably, with the exact p values is not acceptable. Statistical significance alone is not a justification for publication. It is, therefore, important to note that Biomarkers policy is that well-designed studies, which produce negative results, are viewed favourably for publication. This policy also meets the ICMJE (2008) obligation to publish negative studies. The Refereeing Challenge An email arrives to the ‘volunteer’: • Invitation to Review Manuscript ID xxx • Recently, you agreed to review Manuscript ID xxx. A previous e-mail was sent to you four days ago as a reminder that your review was due. We have yet to receive your review of this manuscript. Indication of the problem • • • • 42,328 journals listed in PubMed (Biomarkers is # 29,757) Estimated 1 million+ papers in them a year? “A reasonably mature journal like Neuroimage would hope to see between 70% and 90% of submissions rejected.” • The Intergovernmental Panel on Climate Change (IPCC) latest report is based upon 73,000 publications (25% of them in Chinese) 100-fold increase in 30 years (Economist 5/4/14) How long does it take to referee a paper? • Initial read through (30-40) minutes: locate the paper in the scientific universe (main concepts/general theme).. 10-20 minutes digesting, follow up legends of figure and table to see if they link in with what I read in the text. Is this groundbreaking work or junk? Can I tell? (1 hour) • Second read through (two days later) (1hr) more detailed, identification of main methods, any limitations, uncertainties, unanswered questions, link more closely text (especially conclusions) to data, analysis and results. Identify figures don't match text or legends. Results seem odd?. Gaining confidence in view that uninteresting and should be rejected. First draft of referee report (1 hour) • Third reading (next day) (40 mins) concentration on areas not completely understood and to identify key points in referee's report to justify recommending rejection rather than resubmission. Write up report and send back to editor 20 mins:, completion of final report to editor (1 hour) • Total 3 hours plus time taken to access websites, remember passwords etc. Follow on Statistics: Guidelines are given in Lovell, D.P. (2012) BiOMARKERS 17(3) 193-200. In brief, BiOMARKERS expects authors to be aware of the appropriate statistical analyses that should be used in their specific field of research and prepare their submissions accordingly. A statistical analysis plan should be available for the studies reported and, if required, all relevant data and analyses must be accessible to reviewers. Authors should indicate how datasets used in the analyses will be maintained and be willing to make their data available to other researchers. The author(s) responsible for statistical design and analysis must be indicated as a point of contact on the title page by the # symbol. If a statistician was employed for the analysis, but is not an author, s(he) must be identified and have agreed that their name and email address will appear in the acknowledgements section. Following on • Citations? • Effect on journal? From Volume 19 2014 onwards: # ***** **** and ****** ********* are responsible for statistical design and analysis. • Speed of response? • How to monitor and follow up? Personal examples Example 1 What goes round comes round • • • • Paper refereed for one journal (rank #1) Identical re-submission to first journal Asked to referee by 2nd journal. Refused. Asked to referee by 3rd journal. Refused. Example 2 “You can be one of the authors” Reviewer Comments: “A statistician with experience in systematic review and meta-analysis should be consulted to assist with the analysis as the description given of the statistics and the summary statistics provided are not correct. Authors’ Comments: Reviewer clearly very upset with basic statistical analysis performed. We do not have the knowledge to perform such analysis given that the individual papers are so heterogenous. Unusual analysis required. Likely lower impact journal would have accepted our statement! Authors Suggested Action: Suggest contact Dr. Lovell and offer authorship in return for review of statically (sic) element of paper “We have got your name from Prof ****** ******* because we need some support to improve a systematic review paper on the long term issues associated with ********. Paper has been accepted with major revision in the American Journal of ********.” (Impact Factor 2.516) Example 3 “Although there is a distinct grouping of CASE samples on the left side of the map, there are many other samples that are located throughout the Control samples, indicating these samples have measures for these analytes that are more similar to Controls than CASE. Thus these analytes are not sufficient to differentiate between all members of the 2 groups.” Report from company bioinformatician Scatterplot of PC II vs PC I 7.5 Group C ASE C ontrol 5.0 PC II 2.5 0.0 -2.5 -5.0 -7.5 -5.0 -2.5 0.0 PC I 2.5 5.0 Scatterplot of C103 vs C3 Group C ASE C ontrol 5.0 2.5 C103 0.0 -2.5 -5.0 -7.5 0 20 40 60 C3 80 100 Scatterplot of PC II vs PC I 7.5 Batch 1 1 2 5.0 PC II 2.5 0.0 -2.5 -5.0 -7.5 -5.0 -2.5 0.0 PC I 2.5 5.0 Scatterplot of PC I vs Study ID Batch 1 1 2 5.0 2.5 PC I 0.0 -2.5 -5.0 -7.5 1000 1020 1040 1060 Study ID 1080 1100 1120 Scatterplot of PC I vs Study ID Group C ASE C ontrol 5.0 2.5 PC I 0.0 -2.5 -5.0 -7.5 1000 1020 1040 1060 Study ID 1080 1100 1120 “As you can imagine I am slightly distraught but have been in contact with both the company and the statistician to look again at the data to quantify the effect” “I find it interesting there are two distinct groupings unlinked to my clinical categorisation.” “I did send two lots of samples separated by about a year but this is the only difference - I wonder if they separate on ID number (1001-1050 roughly went first) 1151 onwards went second. I was expecting there to be minimal experimental variation as this is what the company promise so this would be an important quality control issue to flag up.” “…there were 18 months between batches so entirely possible this explains the split. ” All other methods of collection and storage remained the same as far as I am aware. • I am essentially going to bin these results • Thanks for pointing it out - I at least can resend them in one go and hopefully ******* will be able to give me a discount on principle - if not the grant can take it luckily • I am going to withdraw all the papers obviously Challenges and solutions? Fraud and forensic bioinformatics Potti et al and Duke University http://arxiv.org/pdf/1010.1092.pdf After thousands of hours of investigation, three clinical trials at Duke University in Durham, North Carolina, were suspended in late 2009 because of the irreproducibility of the genomic 'signatures' used to select cancer therapies for patients. Journals have a duty to help the community by maintaining reproducibility as a cornerstone of the scientific process. “They also noted that the internal committees responsible for protecting patients and overseeing clinical trials lacked the expertise to review the complex, statistics-heavy methods and data produced by experiments involving gene expression “ “That is a theme the investigating committee has heard repeatedly. The process of peer review relies (as it always has done) on the goodwill of workers in the field, who have jobs of their own and frequently cannot spend the time needed to check other people's papers in a suitably thorough manner. (Dr McShane estimates she spent 300-400 hours reviewing the Duke work, while Drs Baggerly and Coombes estimate they have spent nearly 2,000 hours.) Moreover, the methods sections of papers are supposed to provide enough information for others to replicate an experiment, but often do not. Dodgy work will out eventually, as it is found not to fit in with other, more reliable discoveries. But that all takes time and money.” Economist Sep 10th 2011 http://www.economist.com/node/21528593 Challenges to Guideline approaches • Academic science is better than GLP science? • “But scientific reform is needed as well. For decades, regulatory bodies have relied on guideline studies conducted under national and internationally agreed standards known as Good Laboratory Practice (GLP). This governs how the studies are planned, performed, monitored, recorded, reported and archived. These standards are invaluable, providing a guarantee of reliability and cross-comparability for studies on chemical safety. But the glacial pace of consensus building and validation required to update guidelines can leave gaping holes that allow the approval of chemicals of questionable safety.” http://www.nature.com/nature/journal/v464/n7292/full/4641103b.html “Moreover, detecting BPA's effects generally requires cutting-edge biological techniques whose results, in the eyes of regulatory bodies, carry just a fraction of the weight of those produced by a GLP study.” Séralini et al (2012) paper Séralini et al (2012) paper “The Editor-in-Chief again commends the corresponding author for his willingness and openness in participating in this dialog. The retraction is only on the inconclusiveness of this one paper. The journal’s editorial policy will continue to review all manuscripts no matter how controversial they may be. The editorial board will continue to use this case as a reminder to be as diligent as possible in the peer review process.” “Ultimately, the results presented (while not incorrect) are inconclusive, and therefore do not reach the threshold of publication for Food and Chemical Toxicology. The peer review process is not perfect, but it does work. The journal is committed to getting the peer-review process right, and at times, expediency might be sacrificed for being as thorough as possible. The time-consuming nature is, at times, required in fairness to both the authors and readers. Likewise, the Letters to the Editor, both pro and con, serve as a post-publication peerreview. The back and forth between the readers and the author has a useful and valuable place in our scientific dialog.” FCT (2014) “Efforts to suppress scientific findings, or the appearance of such, erode the scientific integrity upon which the public trust relies. The retraction by the FCT marks a significant and destructive shift in management of the publication of controversial scientific research. Equally troublesome is that this retraction does not really impact how the science will be viewed by scientists, but only how it is viewed by others outside of the scientific community. We feel the decision to retract a published scientific work by an editor, against the desires of the authors, because it is “inconclusive” based on a post hoc analysis represents a dangerous erosion of the underpinnings of the peer-review process, and Elsevier should carefully reconsider this decision.” Portier et al (2014) Inconclusive Findings: Now You See Them, Now You Don’t! Environmental Health Perspectives volume 122 February 2014 Genetics Economist 4/1/14 Illumina this week (15/1/14) claimed to be the first company to achieve the coveted $1,000 genome. • Genome-wide Association Studies (GWAS) • Next Generation Sequencing (NGS) • Analysis of exome chip, exome sequencing data and whole genome sequencing data. • Haplotype mapping, analysis of structural variation, meta-analysis and gene-environment interaction • Qualitative differences, stable within the individual and over time (cancer/mutation etc) Epigenetics Marksists • Something revolutionary. • Studying all the marks left in the genome that form the basis of epigenetics suggests a new type of scientists is about to appear: Marksists. Epigenetics • • • Epgienetic marks Methylation of DNA bases, histone variation Switch genes on or off and/or regulates them • Epigenome • • • • • “The epigenome comprises all of the chemical compounds that have been added to the entirety of one’s DNA (genome), but are not part of the DNA sequence, as a way to regulate the activity (expression) of all the genes within the genome.” . 10’s of millions of methylation sites Pattern variable within individual and over time Chip ($200) 450,000 sites • • • • Inter-generational and trans-generational inheritance Three generation test (link to reproductive toxicity) Male-mediated teratogenesis “Sins of the grandmother” Epigenome-wide association studies (EWAS) EWAS • The investigation of the distribution of methyl groups at thousands of specific DNA nucleotides across the genome to identify arrangements that are common in a disease, or associated with variation in a trait. • “The problem with EWAS is that there’s so much more that can confound an outcome compared with a GWAS,” (John Greally) • Epigenetic signatures which were thought to result from ageing, instead reflected the changing proportions of blood cell types with age. • Methods for analysing chemical patterns on DNA shows promise for explaining disease, but few results have yet been replicated. Examples • • • • • • • • • Stressful home life associated with shorter telomeres in a group of 9 yo boys Psychotherapy can alter methylation Post-traumatic stress disorders (PDST) “unusual profiles” People abused as children differ from those abused as adults) Patterns related to suicide, successful dieting, US social status) Drugs can alter epigenome Holocaust survivors v. no traumatic experience Hungerwinter study of the Dutch Famine Birth Cohort Maternal nutrition around the time of conception can affect the regulatory tagging of child’s DNA • • Marks left by: smokers, ex-smokers, food, Diesel fumes, pesticides arsenic ‘produce distinct patterns’ • • Male mice with folate deficiencies ‘reprogram sperm’ Markers associated with rodent stress in early life correlated with certain aversive behaviours same marks could be found in their offspring, and in some cases, in their offspring’s offspring • Methylation profile involving around 400 sites give five years' warning of the onset of breast cancer http://www.economist.com/news/science-andtechnology/21591547-lack-folate-diet-male-mice-reprogramstheir-sperm-ways Skinner, M. K. (2008). What is an epigenetic transgenerational phenotype? F3 or F2. Reprod. Toxicol. 25, 2–6. Male mice of great grandmothers who have been exposed to PCB have lower sperm counts that others whose great grandmothers weren’t (Poscar et al, 2013) How do you design studies that control all the confounders over three generations? Discussion Points • Is peer review feasible if every paper can be published somewhere? • Can guidelines by themselves be used to ‘police’ the literature? • Is it realistically possible to review multiauthor/multi-disciplinary work? • Is it possible to ensure quality prospectively or retrospectively? • How (or even should we be trying) to maintain quality as the scientific output becomes increasingly global/multi-polar? • Should reproducibility be a pre-condition for publication? • ‘Black boxes’? • Bioinformatics and/or statistics? • Use of Check Lists? • Perception of statistics as a tool (technical rather than scientific)? • “I’m a molecular biologist, we don’t need statistics” Nature 13th February 2014 Nature Editorial 13th February 2014 “Too many researchers have an incomplete or outdated sense of what is necessary in statistics; this is a broader problem than misuse of the P value. Among the most common fundamental mistakes in research papers submitted to Nature, for instance, is the failure to understand the statistical difference between technical replications and independent experiments.“ "Department heads, lab chiefs and senior scientists need to upgrade a good working knowledge of statistics from the ‘desirable’ column in job specifications to ‘essential’. But that, in turn, requires universities and funders to recognize the importance of statistics and provide for it." "Good statistics can no longer be seen as something that makes science better — it is a fundamental requirement, and one that can only grow in importance as funding cuts bite and competition for resources intensifies." "Correctable weaknesses in the design, conduct, and analysis of biomedical and public health research studies can produce misleading results and waste valuable resources. Small effects can be difficult to distinguish from bias introduced by study design and analyses. An absence of detailed written protocols and poor documentation of research is common. Information obtained might not be useful or important, and statistical precision or power is often too low or used in a misleading way. Insufficient consideration might be given to both previous and continuing studies. Arbitrary choice of analyses and an overemphasis on random extremes might affect the reported findings. Several problems relate to the research workforce, including failure to involve experienced statisticians and methodologists, failure to train clinical researchers and laboratory scientists in research methods and design, and the involvement of stakeholders with conflicts of interest. Inadequate emphasis is placed on recording of research decisions and on reproducibility of research. Finally, reward systems incentivise quantity more than quality, and novelty more than reliability. We propose potential solutions for these problems, including improvements in protocols and documentation, consideration of evidence from studies in progress, standardisation of research efforts, optimisation and training of an experienced and non-conflicted scientific workforce, and reconsideration of scientific reward systems.“ Ioannidis et al (2014) Research: increasing value, reducing waste 2 Published Online January 8, 2014 http://dx.doi.org/10.1016/ S0140-6736(13)62227-8 New Scientist Survey • N =122 (out of 1000 stem cell researchers) • 55% thought stem cell research is put under more pressure that other areas of biomedical science http://www.newscientist.com/articleimages/mg22129623.400/1-stem-cell-scientistsreveal-unethical-work-pressures.html New Scientist 29/3/14 http://www.newscientist.com/data/doc/article/dn25281/stemcellsurveypdf1.pdf The importance of transparent reporting of biomarker studies Doug Altman Centre for Statistics in Medicine University of Oxford The importance of transparent reporting Research only has value if – Study methods have validity – Research findings are published in a usable form The goal should be transparency – Should not mislead – Should allow replication (in principle) – Can be included in systematic review and meta-analysis 85 Biomarker studies: Focus on studies of prognosis Prognosis refers to the risk of future health outcomes in individuals or groups with a given disease or health condition The study of prognosis has never been more important – more people are living with conditions impairing health due to improvements in life expectancy Understanding and improving prognosis is pivotal to the practice of clinical medicine Prognostic information is increasingly used by clinicians to help manage patients 86 Prognostic research themes 1) Fundamental prognosis research The course of health related conditions in the context of the nature and quality of current care 2) Prognostic factor research Specific factors (such as biomarkers) that are associated with prognosis 3) Prognostic model research The development, validation, and impact of statistical models that predict individual risk of a future outcome 4) Stratified medicine research The use of prognostic information to help tailor treatment decisions to an individual or group of individuals with similar characteristics [Hemingway et al. BMJ 2013] 87 Prognostic research themes 1) Fundamental prognosis research The course of health related conditions in the context of the nature and quality of current care 2) Prognostic factor research Specific factors (such as biomarkers) that are associated with prognosis 3) Prognostic model research The development, validation, and impact of statistical models that predict individual risk of a future outcome 4) Stratified medicine research The use of prognostic information to help tailor treatment decisions to an individual or group of individuals with similar characteristics [Hemingway et al. BMJ 2013] 88 Prognostic factor research Aims to identify factors associated with subsequent clinical outcome in people with a particular disease or health condition Examples: Biological (biomarkers) – – – – genomic proteomic imaging physiological variables Others – psychosocial (e.g. depression) – ecological (e.g. area-level social deprivation) 89 90 Hamilton et al, J Transl Med 2010 91 92 Prognostic importance of a single specific prognostic factor/marker A clear view of the benefit of a marker is only likely to emerge from looking across multiple studies – Systematic review We should by now know the prognostic importance of numerous markers that have been extensively investigated for many cancers and other diseases – Why don’t we? 93 Example: p53 as a prognostic marker in bladder cancer Systematic review of literature 168 published studies >10000 patients “After 10 years of research, evidence is not sufficient to conclude whether changes in P53 act as markers of outcome in patients with bladder cancer.” [Malats et al, Lancet Oncology 2005] Example: Ki-67 in Breast cancer Systematic review 43 studies >15,000 patients Some evidence of publication bias “Whether these proliferation markers provide additional prognostic information to commonly used prognostic indices remains unclear.” [Stuart-Harris et al, Breast 2008] 95 Evidence from systematic reviews that the quality of prognostic factor research needs to improve Coronary disease “Multiple types of reporting bias, and publication bias, make the magnitude of any independent association between CRP and prognosis among patients with stable coronary disease sufficiently uncertain that no clinical practice recommendations can be made.” [Hemingway et al, PLoS Med 2010] Osteosarcoma “93 papers were studied in depth … Only 7 papers were of sufficient quality to analyze ... Because of heterogeneity of the studies, pooling results is hardly possible.” [Bramer et al, Eur J Surg Oncol 2009] Peptic ulcer perforation “Fifty prognostic studies with 37 prognostic factors comprising a total of 29,782 patients were included in the review. The overall methodological quality was acceptable, yet only two-thirds of the studies provided confounder adjusted estimate” [Moller et al, Scand J Gastroenterol] 96 Multiple studies Clinical and methodological heterogeneity – – – – Different patient groups Different assays/measurement techniques Variation in cutpoints Adjustment for different other variables (or none) … leading to – confusion – amplification of biases Results are probably not reliable even if there is apparently a clear picture – More studies may make things worse! 97 Publication bias “… the literature is probably cluttered with falsepositive studies that would not have been submitted or published if the results had come out differently.” [Simon, 2001] 98 “Together with the long recognized problem of publication bias favoring studies that report positive findings, the result may be a body of literature that is heavily influenced by false-positive findings.” 99 Bcl-2 Martin et al, BJC 2003 100 Hemingway et al, PLoS Med 2010 83 studies of C-reactive protein in stable coronary artery disease 101 Prognostic factor research Limitations Small samples Poor statistical analysis – adjustment for known predictors, handling of continuous variables Heterogeneous laboratory methods Lack of replication Poor publication practices – Inadequate reporting – Selective publication Reliable answers require better studies – especially planned collaborative studies leading to IPD metaanalysis 102 Reporting guidelines JNCI, BJC, JCO, EJC 2005 103 REMARK: REporting guidelines for tumor MARKer prognostic studies Recommended reporting elements to facilitate – Evaluation of appropriateness & quality of study design, methods, and analysis – Understanding of context in which conclusions apply – Reproducibility – Comparisons across studies, including formal metaanalyses 104 REMARK checklist elements Introduction Markers examined Study objectives Methods Patients Specimen characteristics Assay methods Study design Statistical analysis methods Results Data Analysis & presentation Discussion Interpretation Implications 20 items in total 105 REMARK Item 17: “Among reported results, provide estimated effects with confidence intervals from an analysis in which the marker and standard prognostic variables are included, regardless of their statistical significance” 106 129 articles 36% included the marker in a multivariable model with standard clinical variables 107 Vickers et al, Cancer 2008 Reporting of prognostic studies – Pre-REMARK First 10 articles 5 high profile cancer journals 2006-7 REMARK item Reported Number of patients overall Assessed for eligibility 56% Excluded 54% Number available for analysis Patients 98% Events 50% Number in univariable analysis Patients 54% Events 21% Numbers in multivariable analysis Patients 54% Events 30% Mallett et al, BJC 2010 108 Prognostic model research A prognostic model is a formal combination of multiple prognostic factors from which risks of a specific endpoint can be calculated for individual patients. Also called: prognostic (or prediction) index prognostic (or prediction) rule risk (or clinical) prediction model 109 Prognostic model research Uses Clinical practice – Communication with patients/relatives – Risk stratification Design and analysis of clinical trials Case mix adjustment 110 111 Prognostic model research 112 Prognostic model research Major steps: Development External validation Identification and combination of variables associated with outcome Evaluate the model’s predictive ability in a different population Impact Evaluate the impact of the use of the prognostic model on health outcomes 113 Published Prediction Models 111 models for prostate cancer (Shariat 2008) 102 models for traumatic brain injury (Perel 2006) 83 models for stroke (Counsell 2001) 54 models for breast cancer (Altman 2009) 43 models for type 2 diabetes (Collins 2011; van Dieren 2012) 20+ more models have since been published! 31 models for osteoporotic fracture (Steurer 2011) Omitted FRAX due to insufficient information 29 models in reproductive medicine (Leushuis 2009) 26 models for hospital readmission (Kansagara 2011) >25 models for length of stay after cardiac surgery (Ettema 2010) 13 models for tooth decay (Ritter 2010) Very few of these models have been ‘validated’ in new data and compared 114 Prediction Models in UK Clinical Guidelines Framingham Risk Score & QRISK2 (NICE CG67) – 10-year CVD risk Nottingham Prognostic Index (NICE CG80) – Recurrence & survival in breast cancer patients FRAX & QFracture (NICE CG146) – 10-year osteoporotic and hip fracture risk GRACE/PURSUIT/PREDICT/TIMI (NICE CG94) – Adverse CV outcomes in patients with UA/NSTEMI APGAR (NICE CG132/2) – Newborn prognosis SAPS & APACHE (NICE CG50) – ICU scoring systems Leicester Diabetes Risk Score, QDSCORE, Cambridge Risk score (NICE PH38) – Type 2 diabetes 115 Model development Select important candidate predictors – Trying to avoid selection based on statistically significant univariable associations with outcome Appropriately handle (acknowledge) missing data Fit a multivariable model Estimate the predictive performance – Calibration and discrimination – Quantify any optimism from overfitting • Use bootstrapping (avoid randomly splitting a dataset) Prediction models should be presented in adequate detail to allow predictions in individuals, either for subsequent validation studies or in clinical practice 116 Why do we need to validate a model? Deficiencies in design of prognostic studies Deficiencies of standard modelling methods Models may not be transportable – over-optimism because of data-dependent analysis choices – variation in ‘case-mix’ Performance cannot be predicted – Need empirical demonstration of model performance Usefulness is determined by how well a model works in practice, not by P values An important feature of validation is to provide an unbiased estimate of prediction error 117 Poor reporting … and poor conduct Reviews of published studies Diabetes (Collins et al, BMC Med 2011) Cancer (Mallett et al, BMC Med 2010) Kidney disease (Collins et al, J Clin Epidemiol 2012) General medical journals (Bouwmeester et al, PLoS Med 2012) Breast cancer (Altman, Cancer Invest 2009) Missing data in prognosis studies (Burton, Br J Cancer 2004) and many more…. 118 Conclusions from the systematic reviews Poor reporting – Number of events often difficult to identify • Candidate predictors (and number) inadequately defined – Insufficient information to determine events per variable (EPV) • 40% studies (Mallett 2010; 33% Collins 2011) – How candidate predictors were selected • Unclear in 25% studies (Bouwmeester 2012) – How the multivariable model was derived • Unclear in 77% studies (Mallett 2010) 119 Conclusions from the systematic reviews Poor reporting – Missing data rarely mentioned (41% Collins 2010; 45% Collins 2012) – Missing data is often an exclusion criterion (but often not specified) – Complete-case analysis usually carried out Model often not reported in full – intercept missing for logistic regression, – baseline survival missing for Cox regression models Ranges of continuous predictors rarely reported 120 Conclusions from the systematic reviews Methodological shortcomings including – Small sample size (number of events) [EPV<10] – Large number of candidate predictors – Calibration rarely assessed • 74% not done (Collins); 46% not done (Bouwmeester) – Dichotomization of all/some continuous predictors • 63% of studies (Collins); 70% of studies (Mallett) – Previously published models often ignored – Inadequate validation • Reliance on random-split (often using an already small dataset) to validate Lack of comparisons of competing models on same dataset – Siontis et al, BMJ (2012), Collins & Moons, BMJ (2012) 121 External validation Separate dataset (not a random split) – Different centres (geographical validation) – Different time period (temporal validation) – Different case-mix – Possibly with different definitions of predictors and outcome Ideally conducted by independent researchers 122 Evaluating model performance Performance of prediction models is characterised by – Calibration • agreement between observed outcomes and predictions – Often ignored – Preferably assessed graphically – Discrimination • ability to distinguish between patients who do and do not experience the event of interest – Usually reported (c-index) 123 Calibration plot for a scoring system for predicting postoperative nausea and vomiting (PONV) [van den Bosch et al 2005] 124 125 Review of published validation studies [Collins et al, 2014] Reviewed 78 articles that evaluated 120 prediction models in participant data that were not used to develop the model 16% did not report number of outcome events in the validation dataset 54% made no explicit mention of missing data 67% did not report evaluating model calibration 126 Transparent Reporting of multivariable models for Individual Prognosis Or Diagnosis (TRIPOD) Consensus-based guidelines for improving the quality of reporting of multivariable prediction modelling studies Focus on reporting (but much attention to methodological conduct in long E&E paper) Steering group: Gary Collins (Oxford) Karel Moons (UMC Utrecht) Doug Altman (Oxford) Hans Reitsma (UMC Utrecht) 127 TRIPOD checklist elements Introduction Title and abstract Introduction Background & Objectives Methods Source of data Participants Outcome Predictors Sample size Missing Data Statistical analysis methods Risk groups Development versus validation Results Participants Model Development Model Presentation Model Performance Model Updating Discussion Limitations Interpretation Implications Other information 22 items in total 128 TRIPOD Key minimal information deemed important to report Help authors, peer reviewers, editors, readers and potential users Educational – providing guidance, cautioning against particular approaches Improve evaluating risk of bias (PROBAST) if more information is reported Submitted for publication in March 2014 129 Published prognostic studies Poor methods are widely used – Exploratory studies presented as if confirmatory We need high quality reporting so we can identify and discard bad studies – REMARK, TRIPOD, … Other initiatives… 130 131 “Across many types of research, accumulating evidence of bias has led to increasing support for greater transparency, especially relating to registration, publication of full protocols, and adherence to reporting guidelines. None of these will solve all the problems, but certainly all will help.” 132 133 Assessing risk of bias (QUIPS tool) 134 QUIPS: 6 bias domains 1. 2. 3. 4. 5. 6. Participation Attrition Prognostic factor measurement Confounding measurement and account Outcome measurement Analysis and reporting 135 Phases in prediction model research Development – Predictor selection, model building – Internal validation (evaluating optimism) • Split sample (random) – inefficient - not very useful • Cross-validation • Bootstrapping Evaluate performance (external validation) – Temporal & geographical validation – Independent validation (i.e. independent investigators) Impact study – Does the prognostic model improve patient outcomes? – Does the prognostic model change clinician behaviour? – Is the prognostic cost effective? 136 Assessing performance: comparing observations with predictions Comparison of observed and predicted event rates for groups of patients (calibration) – Can plot observed proportions of events against predicted probabilities – Ideally observed and predicted proportions agree over the whole range of probabilities, and plot shows a 45 line – Can fit model: observed mortality = a + b x risk score Measures that distinguish between patients who do or do not experience the event of interest (discrimination) – Discrimination is often assessed in graph or in table 137