Big Data in the UK Biobank: Opportunities and Challenges

Big Data in UK Biobank:
Opportunities and Challenges
Funders: Wellcome Trust and Medical Research Council,
with Department of Health, Scottish & Welsh Governments,
British Heart Foundation and Diabetes UK
Rory Collins
UK Biobank Principal Investigator
BHF Professor of Medicine & Epidemiology
Nuffield Department of Population Health
University of Oxford, UK
UK Biobank Prospective Cohort
• 500,000 UK men and women aged 40-69 years
when recruited and assessed during 2006-2010
• Extensive baseline questions and measurements,
with stored biological samples (and opportunities
to add enhanced assessments in large subsets)
• Repeat assessments over time in subsets of the
participants to allow for sources of variation
• General consent for follow-up through all health
records and for all types of health research
• Sufficiently large numbers of people developing
different conditions to assess causes reliably
Need for prospective studies to be LARGE:
CHD versus SBP for 5K vs 50K vs 500K people
in the Prospective Studies Collaboration (PSC)
5000 people
50,000 people
256
256
128
128
500,000 people
Age at risk:
80-89
Age at risk:
80-89
256
128
70-79
64
60-69
32
50-59
70-79
Age at risk:
80-89
64
64
60-69
32
32
70-79
50-59
60-69
16
16
8
50-59
40-49
4
16
8
40-49
8
4
4
2
2
2
1
1
1
120
140
160
180
Usual SBP (mmHg)
120
140
160
180
Usual SBP (mmHg)
40-49
120
140
160
180
Usual SBP (mmHg)
Locations of
UK Biobank
assessment
centres around
the UK (with
people recruited
from urban and
rural areas)
UK Biobank: 500,000 participants
aged 40-69 recruited in 2007-10
Age
Gender
Deprivation
40-49
119,000
50-59
168,000
60-69
213,000
Male
228,000
Female
270,000
More
92,000
Average
166,000
Less
241,000
Generalisability (not representativeness): Heterogeneity of study
population allows associations with disease to be studied reliably
Production line baseline assessment visit
(improved throughput; efficient staffing)
Baseline assessment: Questionnaire content
Self-completion: topics
Median time
(minutes)
Socio-demographics
1.7
Ethnicity
0.1
Work-employment
1.4
Physical activity
4.4
Smoking (non-smokers)
0.5
(past/current smokers)
1.5
Diet (food frequency)*
4.5
Alcohol
1.1
Sleep
1.2
Sun exposure
1.3
Environmental exposures
1.0
Early life factors
0.8
Family history of common diseases
1.6
Reproductive history & screening (women)
2.4
(men)
0.8
Sexual history
0.4
General health
2.1
Past medical history & medications
1.6
Noise exposure
1.0
Psychological status
4.5
Cognitive function tests
10.0
Hearing speech-in-noise test
8.0
Total time
52.5
Interview: topics
Median time
(minutes)
Medical history/medication
Occupation
Other
3.1
0.4
0.6
Total time
4.1
*Subset of 200,000 participants: repeated
daily diet diaries conducted via the internet
Touchscreen and interview questions
(plus extra enhancement questions)
available at www.ukbiobank.ac.uk
Baseline assessment: Physical measurements
(with enhanced measures in large subsets)
All 500,000 participants
• Blood pressure & heart rate
• Height (standing/seated)
• Waist/hip circumference
• Weight/impedance
• Spirometry
• Heel ultrasound
Subset: 175,000 participants
• Hearing test
• Vascular reactivity
Subset: 120,000 participants
• Visual acuity, refractive index
& intraocular pressure
Subset: 85,000 participants
• Retinal images & optical
coherence tomograms
• Fitness test & ECG limb leads
UK Biobank different types of biological sample:
allowing a wide range of different assays
Sample collection tube
Fractions collected
Potential assays
Na+
• Plasma
• Buffy coat
• Red cells
• Plasma proteome and metabonome
• Assays of genomic DNA
• Membrane lipids and heavy metals
Lithium Heparin (PST)
• Plasma
• Plasma proteome and metabonome
(without haemolysis)
Silica clot accelerator (SST)
• Serum
• Serum proteome and metabonome
(without haemolysis)
Acid citrate dextrose
• Whole blood
• Assays of DNA extracted from EBV
immortalised cell lines
• (B-cell transcriptome)
EDTA
• Whole blood
• Standard haematological parameters
Tempus RNA stabilisation
• Whole blood with lysis reagent
• Blood transcriptome
• Representative transcriptomes of other tissues
Urine
• Urine
• Urine proteome and metabonome
• Gut microbiome
• Mixed saliva sample
• Salivary proteome and metabonome
• Salivary microbiome
• (Mucosal proteome and metabonome)
EDTA
Saliva
Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
• Web-based assessments of diet completed
Web-based dietary assessment: 24-hr recall
• Design considerations:
– Easy and quick: takes only 10-15 minutes
– Automated data collection and coding
– Repeatable (capturing seasonal variation)
– Detailed enough to estimate nutrient intake
• Over 200,000 participants completed the questionnaire
at least once, and about 90,000 did so more than once
Future web-based assessments for exposures
• Cognitive function
– Repeat assessment of baseline measures
– Broaden cognitive phenotyping with new measures
– Complements enhanced cognitive function assessment
that is planned for the imaging assessment visit
• Occupational history
– Information about all previous occupations (not just latest)
– Greater detail on type of work and duration
• Physical activity questionnaire (RPAQ)
– Complement data from activity monitor
Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
• Web-based assessments of diet completed;
and next to be cognition/mental health (2014)
• Wrist-worn accelerometers to be mailed to all
participants who agree to wear one (2013-15)
UK Biobank wrist-worn accelerometer
• ~45% of participants agree to wear one
• Willing participants sent device by mail
• It is to be worn continuously for 7 days
• Returned by mail and data downloaded
• Device cleaned and sent to next participant
• 100K participants from mid-2013 to mid-2015
(50,000 complete data-sets already obtained)
Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
• Web-based assessments of diet completed;
and next to be cognition/mental health (2014)
• Wrist-worn accelerometers to be mailed to all
participants who agree to wear one (2013-15)
• Biobank chip to genotype (GWAS; candidate
SNPs; exome) all participants (2013-15)
Genotyping of all UK Biobank participants
• 820K bespoke UK Biobank Affymetrix genotyping chip:
– 250,000 SNPs in a whole-genome array
– 200,000 markers for known risk factor or disease associations,
copy number variation, loss of function, and insertions/deletions
– 150,000 exome markers for high proportion of non-synonymous
coding variants with allele frequency over 0.02%
• Estimate (“impute”) additional genotypes by combining
measured genotypes with reference sequence data
• Researchers can study associations of genotype data
with biochemical risk factors and detailed phenotyping
from baseline assessment, along with health outcomes
Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
• Web-based assessments of diet completed;
and next to be cognition/mental health (2014)
• Wrist-worn accelerometers to be mailed to all
participants who agree to wear one (2013-15)
• Biobank chip to genotype (GWAS; candidate
SNPs; exome) all participants (2013-15)
• Standard panel of assays (e.g. lipids; clotting)
on samples from all participants (2014-15)
Rationale for assaying many standard markers in
baseline samples from all 500,000 participants
• Cost-effective way of increasing the usability of the
resource for researchers, by providing data for:
– Cross-sectional analyses with prevalent disease
– Identification of subsets based on assay values
• Conducting these assays in all of the participants at
the same time should facilitate good quality control
• Lower cost for conducting all of these assays at one
time rather than in multiple retrievals and assays
• Facilitates management of depletable samples
Consideration of a proposal to conduct assays of
biomarkers of infectious disease in all participants
• Request from the international research community to
facilitate studies of the associations of infectious agents
with disease (in particular, different types of cancer)
• Plan would be to assay a panel of infectious agents
(e.g. HPV, Hepatitis B & C, HBV, EBV, H. pylori) in the
baseline sample collected from all 500,000 participants
• As with the biochemical and genetic assays that are
being conducted, assays of a wide range of infectious
agents would increase the efficient use of the resource
• Detailed proposal for funding is now being developed
Further enhancements of the phenotyping of UK
Biobank participants currently being conducted
• Web-based assessments of diet completed;
and next to be cognition/mental health (2014)
• Wrist-worn accelerometers to be mailed to all
participants who agree to wear one (2013-15)
• Biobank chip to genotype (GWAS; candidate
SNPs; exome) all participants (2013-15)
• Standard panel of assays (e.g. lipids; clotting)
on samples from all participants (2014-15)
• Information from multiple imaging modalities
(e.g. brain/heart/body MRI; bone/joint DEXA)
Imaging of 100,000 UK Biobank participants
• MRI of brain, heart and abdomen
• DEXA of bones, joints and body
• Ultrasound of carotid arteries
• Shortened baseline assessment plus
more detailed cognitive function tests
and ECG to detect rhythm disturbances
Pilot phase: 4-6,000 people in 1 centre (2014-15)
Main phase: 95,000 people in 3 centres (2015-19)
Opportunities for repeat imaging in sub-sets
(e.g. as part of MRC’s focus on dementia)
Body Mass Index (BMI) vs Heart Disease and Stroke
(PSC:1M people followed for 12 years; Lancet 2009)
160
Annual
deaths
per 1000
80
(floated so
mean =
PSC rates at
age 65-69)
40
Heart disease At BMI >25: 5 units
(18 237 deaths)
higher BMI associated
with ~40% higher
IHD & stroke mortality
Stroke
At BMI <25:
positive association
continues for IHD,
but not for stroke
(6122 deaths)
20
10
15 20 25 30 35 40
50
Baseline BMI (kg/m2)
Adjusted for age, sex, smoking & study; first 5 years of follow-up excluded
Similar age, gender, BMI & % body fat,
but different amounts of INTERNAL FAT
5.86 litres of
internal Fat
1.65 litres of
internal fat
Atrial fibrillation (AF): prevalence and mortality
during the period between 1993 and 2007
Prevalence: increasing
Mortality: little change
Piccini et al. Circulation: Cardiovascular Quality and Outcomes. 2012
Consideration of prolonged cardiac monitoring
• Cardiac arrhythmias (especially AF)
– can indicate significant underlying cardiac disease
– can directly cause significant morbidity and mortality
– important risk factors for cardio-embolic events (esp. stroke)
• Detection requires prolonged monitoring
– many are intermittent (e.g. paroxysmal AF)
– substantial under-detection with standard 12 lead ECG
– AF increases with age (<50 years: <1%; >80 years: 10%+)
• No large-scale population-based prospective studies with
prolonged monitoring, so the full extent/impact of AF on
health outcomes is likely to have been underestimated
Example of device for prolonged arrhythmia detection
iRhythmZio Patch
• Has been used in 18,000 people
• Non-invasive stick-on patch
• Comfortable (median wear 12 days)
• Can be applied in clinic or at home
• Beat-to-beat ECG recording
• Validated against reference Holter
• Potentially recyclable device chip
which stores data for downloading
Planning to pilot feasibility and
acceptability during imaging pilot
UK Biobank: Centralised follow-up of health
• Death and cancer registries
• In-patient and out-patient hospital episodes (including
psychiatric) and related procedure registries
• Primary care records of health conditions, prescriptions,
diagnostic tests and other investigations
• Other health-related: disease registries; dispensing
records; imaging; screening; dental records
• Direct to participants: self-reported medical conditions;
treatments actually being taken; degree of functional
impairment; cognitive and psychological scores
Health outcome data-linkage challenges
• Regulation, bureaucracy, and permissions
(despite explicit consent from participants)
• Data transfer, matching and coding queries
• Understanding different data structures
• Mapping between coding systems
• Mapping between different countries
• Presenting outcome data to researchers
– Original outcome codes
– Post-adjudication outcomes
Progress with UK-wide linkage to outcome data
(both before and after baseline assessment)
Meaning of coded data from health records
• What do the coded data actually tell us?
• Characteristics of coded data
– How accurate?
– How detailed?
– How complete?
• Do we need to go beyond the coded data?
UK Biobank: Expected numbers of participants
developing diseases during long-term follow-up
Condition
2012
2017
2022
10,000
25,000
40,000
MI/CHD death
7,000
17,000
28,000
Stroke
2,000
5,000
9,000
COPD
3,000
8,000
14,000
Breast cancer
2,500
6,000
10,000
Colorectal cancer
1,500
3,500
7,000
Prostate cancer
1,500
3,500
7,000
Lung cancer
800
2,000
4,000
Hip fracture
800
2,500
6,000
Rh. arthritis
800
2,000
3,000
Alzheimer’s
800
3,000
9,000
Diabetes
General strategy for outcome adjudication
• Avoid false positive cases (but
tolerate some false negatives)
• Geographical generalisability
• Cost-effectiveness
• Future-proofed
• Scalability
• Staged approach:
– Ascertain
– Confirm
– Classify
Staged approach to outcome adjudication
APPROACH
CHARACTERISTICS
POSSIBLE DATA SOURCES
ASCERTAINMENT
of suspected cases
Cost-effective
Feasible
Scalable
Death registers
Cancer registers
Hospital episodes
Primary care records
Web-based questionnaires
Staged approach to outcome adjudication
APPROACH
CHARACTERISTICS
POSSIBLE DATA SOURCES
ASCERTAINMENT
of suspected cases
Cost-effective
Feasible
Scalable
Death registers
Cancer registers
Hospital episodes
Primary care records
Web-based questionnaires
CONFIRMATION
of “case-ness”
As above, but greater
cost/lower feasibility
Cross-referencing e-records
Disease registers
Staged approach to outcome adjudication
APPROACH
CHARACTERISTICS
POSSIBLE DATA SOURCES
ASCERTAINMENT
of suspected cases
Cost-effective
Feasible
Scalable
Death registers
Cancer registers
Hospital episodes
Primary care records
Web-based questionnaires
CONFIRMATION
of “case-ness”
As above, but greater
cost/lower feasibility
Cross-referencing e-records
Disease registers
CLASSIFICATION
of disease cases
More involved and
costly per case
Review of clinical records
Tumour collections/assays
Specialised databases
(e.g. imaging)
Expert Working Groups developing protocols for
ascertainment, confirmation and classification
Cancer
Diabetes
Cardiac outcomes
Stroke
Pilots progressing well;
preparing for scaling up
of algorithms and then
for web adjudication
Mental health outcomes
Ocular outcomes
Neurodegenerative outcomes
Pilots commencing
Respiratory outcomes
Musculoskeletal outcomes
Pilots being developed
UK Biobank: Principles of Access
• UK Biobank is available to all bona fide researchers for all
types of health-related research that is in public interest
• No preferential or exclusive access (and, in particular,
access does not involve “collaboration” with UK Biobank)
• Researchers have to pay for access to the Resource for
their proposed research on a cost-recovery basis only
• Access to the biological samples that are limited and
depletable will be carefully controlled and coordinated
• Researchers are required to publish their findings and
return the data so that other researchers can use them
“Showcase”: e-catalogue of data items
currently in the UK Biobank Resource
(www.ukbiobank.ac.uk)
Showcase supports search strategies for
data items in the UK Biobank Resource
Body Composition: % Body Fat
Preliminary applications subdivided by type of
researcher, location and type of research
What makes UK Biobank special?
• PROSPECTIVE: It can assess the full effects of a particular
exposure (such as smoking) on all types of health outcome
(such as cancer, vascular disease, lung disease, dementia)
• DETAILED: The wide range of questions, measures and
samples at baseline allows good assessment of exposures,
and outcome adjudication allows good disease classification
• BIG: Inclusion of large number of participants allows reliable
assessment of the causes of a wide range of diseases, and
of the combined impact of many different exposures
Unique combination of
BREADTH and DEPTH