Applying advanced computational approaches for data analytics of electronic health records Nicholas Luscombe (Crick) and Harry Hemingway (UCL) Apply to: UCL Summary: Clinical cardiovascular risk prediction tools such as Framingham and QRISK2 assist in the clinical management of patients by aggregating a range of measurable biomarkers (eg, smoking, blood pressure, BMI), medical history and treatment information to predict disease risk and guide subsequent therapeutic management. However conventional epidemiological approaches are limited to a priori defined predictors; further, models developed using regression based methods provide individual outcomes only (e.g., all-cause mortality, coronary death, non-fatal myocardial infarction) and cannot easily incorporate new data types (eg, genome sequences or novel risk factors from emerging data sources). Curated resources like CALIBER (CArdiovascular disease research using Linked BEspoke studies and Electronic Health Records by Hemingway and Denaxas) provide access to contemporary linked national electronic health records from primary to secondary care, disease registries and mortality registers. These datasets are large with many patients and measurement types but generally sparse; thus they present new opportunities for applying probabilistic modelling and unsupervised machine learning methods as exploratory complements to classical models. Machine learning can enable automatic identification of new candidate prognostic factors that could be trialled as predictors in clinical care. Conditional graphical models or structured output learning can then reveal conditional dependencies between variables that are functions of independent variables and can be updated in real time. This allows users to compare the probabilities of multiple clinical outcomes, and perform more complex, conditional queries (e.g., what is the probability of a set of outcomes given an intervention and/or further observations). Since many cardiovascular risk factors are well-characterised and classical models generally perform well (C-index ~0.8), cardiovascular diseases provide an ideal context for testing modelling approaches, providing state of the art analysis and models to predict heart failure. Early tests of the most basic machine-learning methods have shown that we can achieve at least 80% prediction levels (similar to traditional tests). We expect to improve these predictions with more advanced methods, and moreover to generalise these applications to include data from the UK Biobank and Genomics England across additional clinical areas. References: 1. Narasimhan VM , Hunt KA , Mason D … Hemingway H… van Heel DA. Health and population effects of rare gene knockouts in adult humans with related parents. Science 2016 (in press). 2. Mifsud B, Tavares-Cadete F, Young AN, Sugar R, Schoenfelder S, Ferreira L, Wingett SW, Andrews S, Grey W, Ewels PA, Herman B, Happe S, Higgs A, LeProust E, Follows GA, Fraser P, Luscombe NM, Osborne CS. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. Nat Genet. 2015 Jun;47(6):598-606. 3. Sugimoto Y, Vigilante A, Darbo E, Zirra A, Militti C, D'Ambrogio A, Luscombe NM, Ule J. hiCLIP reveals the in vivo atlas of mRNA secondary structures recognized by Staufen 1. Nature. 2015 Mar 26;519(7544):491-4. 4. Ilsley GR, Fisher J, Apweiler R, De Pace AH, Luscombe NM. Cellular resolution models for even skipped regulation in the entire Drosophila embryo. Elife. 2013 Aug 6;2:e00522. 5. Rapsomaniki E, Timmis A, George J, Pujades-Rodriguez M, Shah AD, Denaxas S, White IR, Caulfield MJ, Deanfield JE, Smeeth L, Williams B, Hingorani A, Hemingway H. Blood pressure and incidence of twelve cardiovascular diseases: lifetime risks, healthy lifeyears lost, and age-specific associations in 1·25 million people. The Lancet 2014; 383(9932):1899–911.