Vanderbilt’s DNA Databank: BioVU Personalized Medicine • • Integration of genomic information into clinical decision making Personalized disease treatment and also preventative therapies What is BioVU? • • • The move towards personalized medicine requires very large sample sets for discovery and validation BioVU: biobank intended to support a broad view of biology and enable personalized medicine Contains de-identified DNA extracted from leftover blood after clinically-indicated testing of Vanderbilt patients who have not opted out • Linked to Synthetic Derivative: de-identified EMR • Current sample number: 135,765 o 120,705 adult samples o 15,099 pediatric samples Patient Communication Modules John Doe A7CCF99DE5732…. One way hash A7CCF99DE65732…. John Doe eligible A7CCF99DE65732…. Extract DNA The “synthetic derivative” (SD): can be updated The Synthetic Derivative • • • A Derivative of the EMR - information content reduced by ‘scrubbing’ identifiers Systematically shifted event dates Contains ~1.9 million records o o o • ~1 million with detailed longitudinal data averaging 100,000 bytes in size an average of 27 codes per record Records updated over time and are current through 4/30/11 Synthetic Derivative Data Types Narratives, such as: • Clinical Notes • Discharge Summaries • History and Physicals • Problem Lists • Surgical Reports • Progress Notes • Letters Diagnostic Codes, Procedural Codes Forms (intake, assessment) Reports (pathology, ECGs, echocardiograms) Clinical Communications Lab Values and Vital Signs Medication Orders TraceMaster (ECGs) Synthetic Derivative ~1.9 million + A7CDE6532 …. …. A7CDE6532 A7CDE6532 …. Synthetic Derivative vs. BioVU BioVU ~135,000 Sample accrual 225,000 200,000 175,000 Current accrual as of 2-13-2012: 135,765 samples 15,099 pediatric 150,000 125,000 Anticipated pediatric samples Anticipated adult sample accrual 100,000 75,000 50,000 25,000 0 Pediatric samples accrued Adult samples accrued BioVU Demographics AGE <1 1 - 10 GENDER Male Female 11 - 20 21 - 30 31 - 40 41 - 50 51 - 60 61 - 70 71 - 75 >75 RACE African American Asian Hispanic Others White BioVU Sample Management RTS SmaRTStore Validation in BioVU • Sample handling algorithms o o • Ancestry o o o • Gender match 1/384 gender mismatches Characterize sample ancestry, assess usefulness of ‘race’ as defined in EMR Provide a panel of ancestry informative markers that define ancestry No significant difference between the concordance of self-report or observer-report with genetic ancestry Demonstration project – American Journal of Human Genetics, 2010 o Can known associations between genetic variants and common diseases be identified in the EMR? The “demonstration project” • Genotype “high-value” SNPs in the first 8,000 samples accrued. o including SNPs associated by replicated genome-wide experiments with common diseases & traits 1. 2. 3. 4. 5. Atrial fibrillation Crohn’s disease Multiple Sclerosis Rheumatoid arthritis Type II Diabetes • Develop Natural Language Processing methods to identify cases and controls • Are genotype-phenotype relations replicated? First results disease Atrial fibrillation Crohn's disease Multiple sclerosis Rheumatoid arthritis Type 2 diabetes marker gene / region rs2200733 Chr. 4q25 rs10033464 Chr. 4q25 rs11805303 IL23R rs17234657 Chr. 5 rs1000113 Chr. 5 rs17221417 NOD2 rs2542151 PTPN22 rs3135388 DRB1*1501 rs2104286 IL2RA rs6897932 IL7RA rs6457617 Chr. 6 rs6679677 RSBN1 rs2476601 PTPN22 rs4506565 TCF7L2 rs12255372 TCF7L2 rs12243326 TCF7L2 rs10811661 CDKN2B rs8050136 FTO rs5219 KCNJ11 rs5215 KCNJ11 rs4402960 IGF2BP2 0.5 0.5 1.0 Odds Ratio 2.0 5.0 5 First results disease Atrial fibrillation Crohn's disease Multiple sclerosis Rheumatoid arthritis Type 2 diabetes marker gene / region rs2200733 Chr. 4q25 rs10033464 Chr. 4q25 rs11805303 IL23R rs17234657 Chr. 5 rs1000113 Chr. 5 rs17221417 NOD2 rs2542151 PTPN22 rs3135388 DRB1*1501 rs2104286 IL2RA rs6897932 IL7RA rs6457617 Chr. 6 rs6679677 RSBN1 rs2476601 PTPN22 rs4506565 TCF7L2 rs12255372 TCF7L2 rs12243326 TCF7L2 rs10811661 CDKN2B rs8050136 FTO rs5219 KCNJ11 rs5215 KCNJ11 rs4402960 IGF2BP2 0.5 0.5 1.0 Odds Ratio 2.0 5.0 5 Types of projects • Discovery or validation of genotype-phenotype relations for disease susceptibility or drug responses • Discovery of new disease/susceptibility genes resequence in patients (obesity, Cushing's, susceptibility to infection, insomnia, pre-term birth) • Access samples without disease X, or “normals” of specified ancestry, or old normals • Phenome-wide association study (PheWAS): in development Data Use Agreement Genotyping Data Accrual Total Genotyped Subjects 60,000 N=56,859 50,000 40,000 Adult Samples 30,000 Ped Samples 20,000 10,000 0 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 Total GWAS Subjects N=14,747 Common Diagnoses in BioVU Examples of ICD-9 codes for rare diseases Example Rare Disease Number in SD Number in BioVU 1,070 85 Pica 115 22 Septicemic Plague 21 0 Pick’s Disease 45 8 Acromegaly and Gigantism 571 123 Ehlers-Danlos Syndrome 285 34 Narcolepsy without Cataplexy 438 76 Spina Bifida 1968 238 Stiff-Man Syndrome 82 17 Tourette Syndrome 667 34 Bell’s Palsy 2534 402 Bulimia Nervosa 919 88 Cushing’s 1443 298 Peyronies Disease 694 157 Wilson’s Disease 140 49 Meningioma 1444 355 Wegener’s 363 141 Microcephalus Not included in SD searches: • Bone marrow transplant • SCID Flagged Compromised samples: • Transfusion within 2 weeks of blood draw • Leukemia • Myeloma • Lymphoma • Pre-leukemic states General algorithm for determining EMR phenotype Definite Cases (algorithm-defined) Possible Cases (require manual review) Excluded (algorithm-defined) Controls (algorithm-defined) • Iteratively refine case definition through partial manual review until case definition yields PPV ≥ 95% • For small case sizes (~100), hand curate cases but use automated case definitions for others • For samples with inadequate counts of “Definite Cases”, manually review possible cases to determine true positives • For controls, exclude all potentially overlapping syndromes and possible matches, iteratively refine such that NPV ≥ 98% The problem with ICD9 codes • ICD9 give both false negatives and false positives • False negatives: • Outpatient billing limited to 4 diagnoses/visit • Outpatient billing done by physicians (e.g., takes too long to find the • unknown ICD9) Inpatient billing done by professional coders: • omit codes that don’t pay well • can only code problems actually explicitly mentioned in documentation • False positives: • Diagnoses evolve over time -- physicians may initially bill for suspected • • diagnoses that later are determined to be incorrect Billing the wrong code (perhaps it is easier to find for a busier clinician) Physicians may bill for a different condition if it pays for a given treatment • Example: Anti-TNF biologics (e.g., infliximab) originally not covered for psoriatic arthritis, so rheumatologists would code the patient as having rheumatoid arthritis EMR Phenotyping Medications + Labs + ICD-9s ≥3 codes Time Constraints Exclusions PHENOTYPE Lessons from preliminary phenotype development • Eliminating negated and uncertain terms: – “I don’t think this is MS”, “uncertain if multiple sclerosis” • Delineating section tag of the note – “FAMILY MEDICAL HISTORY: Mother had multiple sclerosis.” • Adding requirements for further signs of “severity of disease” – For MS: an MRI with T2 enhancement, myelin basic protein or oligoclonal bands on lumbar puncture, etc. – This could potentially miss patients with outside work-ups, however Other lessons (more difficult to correct) • A number of incorrect ICD9 codes for RA and MS assigned to patients • Evolving disease – “Recently diagnosed with Susac’s syndrome - prior diagnosis of MS incorrect.” (Notes also included a thorough discussion of MS, ADEM, and Susac’s syndrome.) • Difference between two doctors: – Presurgical admission H&P includes “rheumatoid arthritis” in the past medical history – Rheumatology clinic visits notes say the diagnosis is “dermatomyositis” - never mention RA • Sometimes incorrect diagnoses are propagated through the record due to cutting-and-pasting / note reuse ANALYSIS PLAN 1. 2. 3. 4. 5. 6. 7. Sample size estimation Dependent/outcome variable Independent variables (include SNPs, covariates, confounders) a. Should have race, gender, age in all plans Statistical method proposed a. Type of model if appropriate b. How SNPs will be coded Power calculation Population stratification plans QC plans a. Call rate, gender checks, HWE – these will be important to do on each dataset pulled to check for phenotype specific QC issues PHENOTYPE PLAN 1. Trait of interest for study 2. Demographic constraints (e.g. gender, age, and/or ethnicity) 3. Cases and controls require outline of definition including: • Inclusion criteria (e.g. ICD9 codes, keyword search, medications, laboratory results) • Exclusion criteria (e.g. ICD9s, keywords, meds, labs, minimum data or follow up) 4. Validation plan for phenotype (e.g. manual review of all or some records) VICTR Funding Investigator query Data use agreement + IRB Approval cases + controls cases + controls eeddd eeddd b b bbbbeed bbbbe d u u e r r d u u b sscccrruubbbbeedd sscccrruubbbbbeeddd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeed ssccrruubbbbeed ssccrruubb ssccrruubb sscr sscr Data use agreement + IRB Approval B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… Investigator query Manual Review cases + controls eeddd eeddd b b bbbbeed bbbbe d u u e r r d u u b sscccrruubbbbeedd sscccrruubbbbbeeddd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeed ssccrruubbbbeed ssccrruubb ssccrruubb sscr sscr Data use agreement + IRB Approval B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. Investigator query cases + controls Sample retrieval cases + controls eeddd eeddd b b bbbbeed bbbbe d u u e r r d u u b sscccrruubbbbeedd sscccrruubbbbbeeddd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeedd ssccrruubbbbeed ssccrruubbbbeed ssccrruubb ssccrruubb sscr sscr Data use agreement + IRB Approval B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… B699tre563msd.. F5rt783mbncds… Investigator query F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. F5rt783mbncds…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. B699tre563msd…. Genotyping, genotypephenotype relations cases + controls Sample retrieval BioVU Genotyping Process BioVU Genotyping Process: Investigator selects cases and controls from Synthetic Derivative Investigator signals BioVU program to initiate sample selection BioVU notifies DNA resources core that samples are ready for selection and picking Genotyped data analyzed by investigator Investigator and BioVU program receive genotype data Samples are provided to appropriate lab and are genotyped BioVU Requests DNA Requests 60 Data Requests 50 40 30 20 10 0 BioVU Requests BioVU Approvals 60 Total Requests 43 Approvals BioVU: New Directions A well characterized cohort of individuals without specific diseases across all ages to be used as controls Expansion of BioVU to capture and store plasma to enable candidate proteomic/biomarker research Expanding BioVU genotyping to include mitochondrial SNP genotyping and copy number variants Link pediatric DNA samples to maternal samples (mom-baby pairs resource) Expansion of BioVU sequencing activities to include whole exome sequencing on targeted populations 71 FAQ “answers” • SD access: “non-human subjects” IRB review (days) • Current access costs: $4/sample • Genotyping data: no charge • Genotyping: o Investigator-funded o Genotyping/sequencing performed in VUMC Core Facilities o Consider VICTR as a funding source Justification must be provided for outside genotyping, including quality control plans Genotype “redeposit” part of the data use agreement Questions? Contact: Erica Bowton PhD BioVU Program Manager erica.bowton@vanderbilt.edu 322-1975