Big Data in UK Biobank: Opportunities and Challenges Funders: Wellcome Trust and Medical Research Council, with Department of Health, Scottish & Welsh Governments, British Heart Foundation and Diabetes UK Rory Collins UK Biobank Principal Investigator BHF Professor of Medicine & Epidemiology Nuffield Department of Population Health University of Oxford, UK UK Biobank Prospective Cohort • 500,000 UK men and women aged 40-69 years when recruited and assessed during 2006-2010 • Extensive baseline questions and measurements, with stored biological samples (and opportunities to add enhanced assessments in large subsets) • Repeat assessments over time in subsets of the participants to allow for sources of variation • General consent for follow-up through all health records and for all types of health research • Sufficiently large numbers of people developing different conditions to assess causes reliably Need for prospective studies to be LARGE: CHD versus SBP for 5K vs 50K vs 500K people in the Prospective Studies Collaboration (PSC) 5000 people 50,000 people 256 256 128 128 500,000 people Age at risk: 80-89 Age at risk: 80-89 256 128 70-79 64 60-69 32 50-59 70-79 Age at risk: 80-89 64 64 60-69 32 32 70-79 50-59 60-69 16 16 8 50-59 40-49 4 16 8 40-49 8 4 4 2 2 2 1 1 1 120 140 160 180 Usual SBP (mmHg) 120 140 160 180 Usual SBP (mmHg) 40-49 120 140 160 180 Usual SBP (mmHg) Locations of UK Biobank assessment centres around the UK (with people recruited from urban and rural areas) UK Biobank: 500,000 participants aged 40-69 recruited in 2007-10 Age Gender Deprivation 40-49 119,000 50-59 168,000 60-69 213,000 Male 228,000 Female 270,000 More 92,000 Average 166,000 Less 241,000 Generalisability (not representativeness): Heterogeneity of study population allows associations with disease to be studied reliably Production line baseline assessment visit (improved throughput; efficient staffing) Baseline assessment: Questionnaire content Self-completion: topics Median time (minutes) Socio-demographics 1.7 Ethnicity 0.1 Work-employment 1.4 Physical activity 4.4 Smoking (non-smokers) 0.5 (past/current smokers) 1.5 Diet (food frequency)* 4.5 Alcohol 1.1 Sleep 1.2 Sun exposure 1.3 Environmental exposures 1.0 Early life factors 0.8 Family history of common diseases 1.6 Reproductive history & screening (women) 2.4 (men) 0.8 Sexual history 0.4 General health 2.1 Past medical history & medications 1.6 Noise exposure 1.0 Psychological status 4.5 Cognitive function tests 10.0 Hearing speech-in-noise test 8.0 Total time 52.5 Interview: topics Median time (minutes) Medical history/medication Occupation Other 3.1 0.4 0.6 Total time 4.1 *Subset of 200,000 participants: repeated daily diet diaries conducted via the internet Touchscreen and interview questions (plus extra enhancement questions) available at www.ukbiobank.ac.uk Baseline assessment: Physical measurements (with enhanced measures in large subsets) All 500,000 participants • Blood pressure & heart rate • Height (standing/seated) • Waist/hip circumference • Weight/impedance • Spirometry • Heel ultrasound Subset: 175,000 participants • Hearing test • Vascular reactivity Subset: 120,000 participants • Visual acuity, refractive index & intraocular pressure Subset: 85,000 participants • Retinal images & optical coherence tomograms • Fitness test & ECG limb leads UK Biobank different types of biological sample: allowing a wide range of different assays Sample collection tube Fractions collected Potential assays Na+ • Plasma • Buffy coat • Red cells • Plasma proteome and metabonome • Assays of genomic DNA • Membrane lipids and heavy metals Lithium Heparin (PST) • Plasma • Plasma proteome and metabonome (without haemolysis) Silica clot accelerator (SST) • Serum • Serum proteome and metabonome (without haemolysis) Acid citrate dextrose • Whole blood • Assays of DNA extracted from EBV immortalised cell lines • (B-cell transcriptome) EDTA • Whole blood • Standard haematological parameters Tempus RNA stabilisation • Whole blood with lysis reagent • Blood transcriptome • Representative transcriptomes of other tissues Urine • Urine • Urine proteome and metabonome • Gut microbiome • Mixed saliva sample • Salivary proteome and metabonome • Salivary microbiome • (Mucosal proteome and metabonome) EDTA Saliva Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed Web-based dietary assessment: 24-hr recall • Design considerations: – Easy and quick: takes only 10-15 minutes – Automated data collection and coding – Repeatable (capturing seasonal variation) – Detailed enough to estimate nutrient intake • Over 200,000 participants completed the questionnaire at least once, and about 90,000 did so more than once Future web-based assessments for exposures • Cognitive function – Repeat assessment of baseline measures – Broaden cognitive phenotyping with new measures – Complements enhanced cognitive function assessment that is planned for the imaging assessment visit • Occupational history – Information about all previous occupations (not just latest) – Greater detail on type of work and duration • Physical activity questionnaire (RPAQ) – Complement data from activity monitor Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed; and next to be cognition/mental health (2014) • Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15) UK Biobank wrist-worn accelerometer • ~45% of participants agree to wear one • Willing participants sent device by mail • It is to be worn continuously for 7 days • Returned by mail and data downloaded • Device cleaned and sent to next participant • 100K participants from mid-2013 to mid-2015 (50,000 complete data-sets already obtained) Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed; and next to be cognition/mental health (2014) • Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15) • Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15) Genotyping of all UK Biobank participants • 820K bespoke UK Biobank Affymetrix genotyping chip: – 250,000 SNPs in a whole-genome array – 200,000 markers for known risk factor or disease associations, copy number variation, loss of function, and insertions/deletions – 150,000 exome markers for high proportion of non-synonymous coding variants with allele frequency over 0.02% • Estimate (“impute”) additional genotypes by combining measured genotypes with reference sequence data • Researchers can study associations of genotype data with biochemical risk factors and detailed phenotyping from baseline assessment, along with health outcomes Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed; and next to be cognition/mental health (2014) • Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15) • Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15) • Standard panel of assays (e.g. lipids; clotting) on samples from all participants (2014-15) Rationale for assaying many standard markers in baseline samples from all 500,000 participants • Cost-effective way of increasing the usability of the resource for researchers, by providing data for: – Cross-sectional analyses with prevalent disease – Identification of subsets based on assay values • Conducting these assays in all of the participants at the same time should facilitate good quality control • Lower cost for conducting all of these assays at one time rather than in multiple retrievals and assays • Facilitates management of depletable samples Consideration of a proposal to conduct assays of biomarkers of infectious disease in all participants • Request from the international research community to facilitate studies of the associations of infectious agents with disease (in particular, different types of cancer) • Plan would be to assay a panel of infectious agents (e.g. HPV, Hepatitis B & C, HBV, EBV, H. pylori) in the baseline sample collected from all 500,000 participants • As with the biochemical and genetic assays that are being conducted, assays of a wide range of infectious agents would increase the efficient use of the resource • Detailed proposal for funding is now being developed Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed; and next to be cognition/mental health (2014) • Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15) • Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15) • Standard panel of assays (e.g. lipids; clotting) on samples from all participants (2014-15) • Information from multiple imaging modalities (e.g. brain/heart/body MRI; bone/joint DEXA) Imaging of 100,000 UK Biobank participants • MRI of brain, heart and abdomen • DEXA of bones, joints and body • Ultrasound of carotid arteries • Shortened baseline assessment plus more detailed cognitive function tests and ECG to detect rhythm disturbances Pilot phase: 4-6,000 people in 1 centre (2014-15) Main phase: 95,000 people in 3 centres (2015-19) Opportunities for repeat imaging in sub-sets (e.g. as part of MRC’s focus on dementia) Body Mass Index (BMI) vs Heart Disease and Stroke (PSC:1M people followed for 12 years; Lancet 2009) 160 Annual deaths per 1000 80 (floated so mean = PSC rates at age 65-69) 40 Heart disease At BMI >25: 5 units (18 237 deaths) higher BMI associated with ~40% higher IHD & stroke mortality Stroke At BMI <25: positive association continues for IHD, but not for stroke (6122 deaths) 20 10 15 20 25 30 35 40 50 Baseline BMI (kg/m2) Adjusted for age, sex, smoking & study; first 5 years of follow-up excluded Similar age, gender, BMI & % body fat, but different amounts of INTERNAL FAT 5.86 litres of internal Fat 1.65 litres of internal fat Atrial fibrillation (AF): prevalence and mortality during the period between 1993 and 2007 Prevalence: increasing Mortality: little change Piccini et al. Circulation: Cardiovascular Quality and Outcomes. 2012 Consideration of prolonged cardiac monitoring • Cardiac arrhythmias (especially AF) – can indicate significant underlying cardiac disease – can directly cause significant morbidity and mortality – important risk factors for cardio-embolic events (esp. stroke) • Detection requires prolonged monitoring – many are intermittent (e.g. paroxysmal AF) – substantial under-detection with standard 12 lead ECG – AF increases with age (<50 years: <1%; >80 years: 10%+) • No large-scale population-based prospective studies with prolonged monitoring, so the full extent/impact of AF on health outcomes is likely to have been underestimated Example of device for prolonged arrhythmia detection iRhythmZio Patch • Has been used in 18,000 people • Non-invasive stick-on patch • Comfortable (median wear 12 days) • Can be applied in clinic or at home • Beat-to-beat ECG recording • Validated against reference Holter • Potentially recyclable device chip which stores data for downloading Planning to pilot feasibility and acceptability during imaging pilot UK Biobank: Centralised follow-up of health • Death and cancer registries • In-patient and out-patient hospital episodes (including psychiatric) and related procedure registries • Primary care records of health conditions, prescriptions, diagnostic tests and other investigations • Other health-related: disease registries; dispensing records; imaging; screening; dental records • Direct to participants: self-reported medical conditions; treatments actually being taken; degree of functional impairment; cognitive and psychological scores Health outcome data-linkage challenges • Regulation, bureaucracy, and permissions (despite explicit consent from participants) • Data transfer, matching and coding queries • Understanding different data structures • Mapping between coding systems • Mapping between different countries • Presenting outcome data to researchers – Original outcome codes – Post-adjudication outcomes Progress with UK-wide linkage to outcome data (both before and after baseline assessment) Meaning of coded data from health records • What do the coded data actually tell us? • Characteristics of coded data – How accurate? – How detailed? – How complete? • Do we need to go beyond the coded data? UK Biobank: Expected numbers of participants developing diseases during long-term follow-up Condition 2012 2017 2022 10,000 25,000 40,000 MI/CHD death 7,000 17,000 28,000 Stroke 2,000 5,000 9,000 COPD 3,000 8,000 14,000 Breast cancer 2,500 6,000 10,000 Colorectal cancer 1,500 3,500 7,000 Prostate cancer 1,500 3,500 7,000 Lung cancer 800 2,000 4,000 Hip fracture 800 2,500 6,000 Rh. arthritis 800 2,000 3,000 Alzheimer’s 800 3,000 9,000 Diabetes General strategy for outcome adjudication • Avoid false positive cases (but tolerate some false negatives) • Geographical generalisability • Cost-effectiveness • Future-proofed • Scalability • Staged approach: – Ascertain – Confirm – Classify Staged approach to outcome adjudication APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires Staged approach to outcome adjudication APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires CONFIRMATION of “case-ness” As above, but greater cost/lower feasibility Cross-referencing e-records Disease registers Staged approach to outcome adjudication APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires CONFIRMATION of “case-ness” As above, but greater cost/lower feasibility Cross-referencing e-records Disease registers CLASSIFICATION of disease cases More involved and costly per case Review of clinical records Tumour collections/assays Specialised databases (e.g. imaging) Expert Working Groups developing protocols for ascertainment, confirmation and classification Cancer Diabetes Cardiac outcomes Stroke Pilots progressing well; preparing for scaling up of algorithms and then for web adjudication Mental health outcomes Ocular outcomes Neurodegenerative outcomes Pilots commencing Respiratory outcomes Musculoskeletal outcomes Pilots being developed UK Biobank: Principles of Access • UK Biobank is available to all bona fide researchers for all types of health-related research that is in public interest • No preferential or exclusive access (and, in particular, access does not involve “collaboration” with UK Biobank) • Researchers have to pay for access to the Resource for their proposed research on a cost-recovery basis only • Access to the biological samples that are limited and depletable will be carefully controlled and coordinated • Researchers are required to publish their findings and return the data so that other researchers can use them “Showcase”: e-catalogue of data items currently in the UK Biobank Resource (www.ukbiobank.ac.uk) Showcase supports search strategies for data items in the UK Biobank Resource Body Composition: % Body Fat Preliminary applications subdivided by type of researcher, location and type of research What makes UK Biobank special? • PROSPECTIVE: It can assess the full effects of a particular exposure (such as smoking) on all types of health outcome (such as cancer, vascular disease, lung disease, dementia) • DETAILED: The wide range of questions, measures and samples at baseline allows good assessment of exposures, and outcome adjudication allows good disease classification • BIG: Inclusion of large number of participants allows reliable assessment of the causes of a wide range of diseases, and of the combined impact of many different exposures Unique combination of BREADTH and DEPTH