Sample accrual

advertisement
Vanderbilt’s DNA Databank:
BioVU
Personalized Medicine
•
•
Integration of genomic information into clinical decision
making
Personalized disease treatment and also preventative
therapies
What is BioVU?
•
•
•
The move towards personalized medicine requires very large
sample sets for discovery and validation
BioVU: biobank intended to support a broad view of biology and
enable personalized medicine
Contains de-identified DNA extracted from leftover blood after
clinically-indicated testing of Vanderbilt patients who have not
opted out
•
Linked to Synthetic Derivative: de-identified EMR
•
Current sample number: 135,765
o 120,705 adult samples
o 15,099 pediatric samples
Patient Communication Modules
John Doe
A7CCF99DE5732….
One way hash
A7CCF99DE65732….
John Doe
eligible
A7CCF99DE65732….
Extract
DNA
The “synthetic derivative”
(SD): can be updated
The Synthetic Derivative
•
•
•
A Derivative of the EMR - information content reduced by
‘scrubbing’ identifiers
Systematically shifted event dates
Contains ~1.9 million records
o
o
o
•
~1 million with detailed longitudinal data
averaging 100,000 bytes in size
an average of 27 codes per record
Records updated over time and are current through 4/30/11
Synthetic Derivative Data Types
 Narratives, such as:
• Clinical Notes
• Discharge Summaries
• History and Physicals
• Problem Lists
• Surgical Reports
• Progress Notes
• Letters
 Diagnostic Codes, Procedural Codes
 Forms (intake, assessment)
 Reports (pathology, ECGs, echocardiograms)
 Clinical Communications
 Lab Values and Vital Signs
 Medication Orders
 TraceMaster (ECGs)
Synthetic Derivative
~1.9 million
+
A7CDE6532 ….
….
A7CDE6532
A7CDE6532 ….
Synthetic Derivative vs. BioVU
BioVU
~135,000
Sample accrual
225,000
200,000
175,000
Current accrual
as of 2-13-2012:
135,765 samples
15,099 pediatric
150,000
125,000
Anticipated pediatric samples
Anticipated adult sample accrual
100,000
75,000
50,000
25,000
0
Pediatric samples accrued
Adult samples accrued
BioVU Demographics
AGE
<1
1 - 10
GENDER
Male
Female
11 - 20
21 - 30
31 - 40
41 - 50
51 - 60
61 - 70
71 - 75
>75
RACE
African American
Asian
Hispanic
Others
White
BioVU Sample Management
RTS SmaRTStore
Validation in BioVU
•
Sample handling algorithms
o
o
•
Ancestry
o
o
o
•
Gender match
1/384 gender mismatches
Characterize sample ancestry, assess usefulness of ‘race’ as
defined in EMR
Provide a panel of ancestry informative markers that define ancestry
No significant difference between the concordance of self-report or
observer-report with genetic ancestry
Demonstration project – American Journal of Human
Genetics, 2010
o
Can known associations between genetic variants and common
diseases be identified in the EMR?
The “demonstration project”
• Genotype “high-value” SNPs in the first 8,000 samples
accrued.
o
including SNPs associated by replicated genome-wide
experiments with common diseases & traits
1.
2.
3.
4.
5.
Atrial fibrillation
Crohn’s disease
Multiple Sclerosis
Rheumatoid arthritis
Type II Diabetes
• Develop Natural Language Processing methods to
identify cases and controls
• Are genotype-phenotype relations replicated?
First results
disease
Atrial fibrillation
Crohn's disease
Multiple sclerosis
Rheumatoid arthritis
Type 2 diabetes
marker
gene /
region
rs2200733
Chr. 4q25
rs10033464
Chr. 4q25
rs11805303
IL23R
rs17234657
Chr. 5
rs1000113
Chr. 5
rs17221417
NOD2
rs2542151
PTPN22
rs3135388
DRB1*1501
rs2104286
IL2RA
rs6897932
IL7RA
rs6457617
Chr. 6
rs6679677
RSBN1
rs2476601
PTPN22
rs4506565
TCF7L2
rs12255372
TCF7L2
rs12243326
TCF7L2
rs10811661
CDKN2B
rs8050136
FTO
rs5219
KCNJ11
rs5215
KCNJ11
rs4402960
IGF2BP2
0.5
0.5
1.0
Odds Ratio
2.0
5.0
5
First results
disease
Atrial fibrillation
Crohn's disease
Multiple sclerosis
Rheumatoid arthritis
Type 2 diabetes
marker
gene /
region
rs2200733
Chr. 4q25
rs10033464
Chr. 4q25
rs11805303
IL23R
rs17234657
Chr. 5
rs1000113
Chr. 5
rs17221417
NOD2
rs2542151
PTPN22
rs3135388
DRB1*1501
rs2104286
IL2RA
rs6897932
IL7RA
rs6457617
Chr. 6
rs6679677
RSBN1
rs2476601
PTPN22
rs4506565
TCF7L2
rs12255372
TCF7L2
rs12243326
TCF7L2
rs10811661
CDKN2B
rs8050136
FTO
rs5219
KCNJ11
rs5215
KCNJ11
rs4402960
IGF2BP2
0.5
0.5
1.0
Odds Ratio
2.0
5.0
5
Types of projects
•
Discovery or validation of genotype-phenotype relations for
disease susceptibility or drug responses
•
Discovery of new disease/susceptibility genes  resequence in
patients (obesity, Cushing's, susceptibility to infection, insomnia,
pre-term birth)
•
Access samples without disease X, or “normals” of specified
ancestry, or old normals
•
Phenome-wide association study (PheWAS): in development
Data Use Agreement
Genotyping Data Accrual
Total Genotyped Subjects
60,000
N=56,859
50,000
40,000
Adult Samples
30,000
Ped Samples
20,000
10,000
0
16,000
14,000
12,000
10,000
8,000
6,000
4,000
2,000
0
Total GWAS Subjects
N=14,747
Common Diagnoses in BioVU
Examples of ICD-9 codes
for rare diseases
Example Rare Disease
Number in SD
Number in BioVU
1,070
85
Pica
115
22
Septicemic Plague
21
0
Pick’s Disease
45
8
Acromegaly and Gigantism
571
123
Ehlers-Danlos Syndrome
285
34
Narcolepsy without Cataplexy
438
76
Spina Bifida
1968
238
Stiff-Man Syndrome
82
17
Tourette Syndrome
667
34
Bell’s Palsy
2534
402
Bulimia Nervosa
919
88
Cushing’s
1443
298
Peyronies Disease
694
157
Wilson’s Disease
140
49
Meningioma
1444
355
Wegener’s
363
141
Microcephalus
Not included in SD searches:
• Bone marrow transplant
• SCID
Flagged Compromised samples:
• Transfusion within 2 weeks of blood draw
• Leukemia
• Myeloma
• Lymphoma
• Pre-leukemic states
General algorithm for determining
EMR phenotype
Definite Cases
(algorithm-defined)
Possible Cases
(require manual review)
Excluded
(algorithm-defined)
Controls
(algorithm-defined)
• Iteratively refine case definition through partial manual review
until case definition yields PPV ≥ 95%
• For small case sizes (~100), hand curate cases but use
automated case definitions for others
• For samples with inadequate counts of “Definite Cases”,
manually review possible cases to determine true positives
• For controls, exclude all potentially overlapping syndromes and
possible matches, iteratively refine such that NPV ≥ 98%
The problem with ICD9 codes
• ICD9 give both false negatives and false positives
• False negatives:
• Outpatient billing limited to 4 diagnoses/visit
• Outpatient billing done by physicians (e.g., takes too long to find the
•
unknown ICD9)
Inpatient billing done by professional coders:
• omit codes that don’t pay well
• can only code problems actually explicitly mentioned in documentation
• False positives:
• Diagnoses evolve over time -- physicians may initially bill for suspected
•
•
diagnoses that later are determined to be incorrect
Billing the wrong code (perhaps it is easier to find for a busier clinician)
Physicians may bill for a different condition if it pays for a given
treatment
• Example: Anti-TNF biologics (e.g., infliximab) originally not covered for
psoriatic arthritis, so rheumatologists would code the patient as having
rheumatoid arthritis
EMR Phenotyping
Medications
+
Labs
+
ICD-9s
≥3 codes
Time
Constraints
Exclusions
PHENOTYPE
Lessons from preliminary phenotype
development
• Eliminating negated and uncertain terms:
– “I don’t think this is MS”, “uncertain if multiple sclerosis”
• Delineating section tag of the note
– “FAMILY MEDICAL HISTORY: Mother had multiple
sclerosis.”
• Adding requirements for further signs of “severity of disease”
– For MS: an MRI with T2 enhancement, myelin basic
protein or oligoclonal bands on lumbar puncture, etc.
– This could potentially miss patients with outside work-ups,
however
Other lessons (more difficult to correct)
• A number of incorrect ICD9 codes for RA and MS assigned to
patients
• Evolving disease
– “Recently diagnosed with Susac’s syndrome - prior diagnosis of
MS incorrect.” (Notes also included a thorough discussion of
MS, ADEM, and Susac’s syndrome.)
• Difference between two doctors:
– Presurgical admission H&P includes “rheumatoid arthritis” in the
past medical history
– Rheumatology clinic visits notes say the diagnosis is
“dermatomyositis” - never mention RA
• Sometimes incorrect diagnoses are propagated through the record
due to cutting-and-pasting / note reuse
ANALYSIS PLAN
1.
2.
3.
4.
5.
6.
7.
Sample size estimation
Dependent/outcome variable
Independent variables (include SNPs, covariates, confounders)
a. Should have race, gender, age in all plans
Statistical method proposed
a. Type of model if appropriate
b. How SNPs will be coded
Power calculation
Population stratification plans
QC plans
a. Call rate, gender checks, HWE – these will be important to do on
each dataset pulled to check for phenotype specific QC issues
PHENOTYPE PLAN
1. Trait of interest for study
2. Demographic constraints (e.g. gender, age, and/or ethnicity)
3. Cases and controls require outline of definition including:
• Inclusion criteria (e.g. ICD9 codes, keyword search, medications,
laboratory results)
• Exclusion criteria (e.g. ICD9s, keywords, meds, labs, minimum data or
follow up)
4. Validation plan for phenotype (e.g. manual review of all or some records)
VICTR Funding
Investigator
query
Data use
agreement +
IRB Approval
cases
+
controls
cases
+
controls
eeddd
eeddd
b
b
bbbbeed
bbbbe d
u
u
e
r
r
d
u
u
b
sscccrruubbbbeedd sscccrruubbbbbeeddd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeed ssccrruubbbbeed
ssccrruubb
ssccrruubb
sscr
sscr
Data use
agreement +
IRB Approval
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
Investigator
query
Manual
Review
cases
+
controls
eeddd
eeddd
b
b
bbbbeed
bbbbe d
u
u
e
r
r
d
u
u
b
sscccrruubbbbeedd sscccrruubbbbbeeddd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeed ssccrruubbbbeed
ssccrruubb
ssccrruubb
sscr
sscr
Data use
agreement +
IRB Approval
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
Investigator
query
cases
+
controls
Sample
retrieval
cases
+
controls
eeddd
eeddd
b
b
bbbbeed
bbbbe d
u
u
e
r
r
d
u
u
b
sscccrruubbbbeedd sscccrruubbbbbeeddd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeedd ssccrruubbbbeedd
ssccrruubbbbeed ssccrruubbbbeed
ssccrruubb
ssccrruubb
sscr
sscr
Data use
agreement +
IRB Approval
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
B699tre563msd..
F5rt783mbncds…
Investigator
query
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
F5rt783mbncds….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
B699tre563msd….
Genotyping,
genotypephenotype
relations
cases
+
controls
Sample
retrieval
BioVU Genotyping Process
BioVU Genotyping Process:
Investigator selects cases
and controls from
Synthetic Derivative
Investigator signals BioVU program
to initiate sample selection
BioVU notifies DNA resources core
that samples are ready for
selection and picking
Genotyped data analyzed
by investigator
Investigator and BioVU program
receive genotype data
Samples are provided to
appropriate lab and are genotyped
BioVU Requests
DNA Requests
60
Data Requests
50
40
30
20
10
0
BioVU Requests
BioVU Approvals
60 Total Requests
43 Approvals
BioVU: New Directions

A well characterized cohort of individuals without specific diseases
across all ages to be used as controls

Expansion of BioVU to capture and store plasma to enable candidate
proteomic/biomarker research

Expanding BioVU genotyping to include mitochondrial SNP
genotyping and copy number variants

Link pediatric DNA samples to maternal samples (mom-baby pairs
resource)

Expansion of BioVU sequencing activities to include whole exome
sequencing on targeted populations
71
FAQ “answers”
•
SD access: “non-human subjects” IRB review (days)
•
Current access costs: $4/sample
•
Genotyping data: no charge
•
Genotyping:
o
Investigator-funded

o
Genotyping/sequencing performed in VUMC Core Facilities

o
Consider VICTR as a funding source
Justification must be provided for outside genotyping, including quality
control plans
Genotype “redeposit” part of the data use agreement
Questions?
Contact: Erica Bowton PhD
BioVU Program Manager
erica.bowton@vanderbilt.edu
322-1975
Download