Overview of the Synthetic Derivative April 16, 2010 Melissa Basford, MBA Program Manager – Synthetic Derivative Synthetic Derivative resource overview Rich, multi-source database of de-identified clinical and demographic data Contains ~1.8 million records ~1 million with detailed longitudinal data averaging 100k bytes in size an average of 27 codes per record Records updated over time and are current through 7/31/09 SD Establishment SD Database Star Server HEO Data Parsing Data Parsing EDW DE-IDENTIFICATION Information collected during clinical care One way hash Restructuring for research Access through secured online application Data export Data Types (so far) Narratives, such as: Clinical Notes Discharge Summaries History & Physicals Problem Lists Surgical Reports Progress Notes Letters & Clinical Communications Diagnostic codes, procedural codes Forms (intake, assessment) Reports (pathology, ECGs, echocardiograms) Lab values and vital signs Medication orders TraceMaster (ECGs) ˜100 SNPs for 7000+ samples Research use cases assumed in resource development (either alone, or with DNA samples) Retrospective chart reviews Hypothesis generation Rapid preliminary data for grant submissions Feasibility assessment Technology + policy De-identification Derivation of 128-character identifier (RUI) from the MRN generated by Secure Hash Algorithm (SHA-512) RUI is unique to input, cannot be used to regenerate MRN RUI links data through time and across data sources HIPAA identifiers removed using combination of custom techniques and established de-identification software Restricted access & continuous oversight Access restricted to VU; not a public resource IRB approval for study (non-human) Data Use Agreement Audit logs of all searches and data exports Date shift feature Our algorithm shifts the dates within a record by a time period that is consistent within each record, but differs across records up to 364 days backwards e.g. if the date in a particular record is April 1, 2005 and the randomly generated shift is 45 days in the past, then the date in the SD is February 15, 2005) What the SD can’t do Outbreaks and other date-specific studies (catastrophes, etc) Find a specific patient (e.g. to contact) Replace large scale epidemiology research (e.g. TennCare database) Temporal search capabilities limited (but under development) “First this, than that” study designs require significant manual effort Expect “timeline” views and searching Q1-Q2 Demographic Characteristics SD Davidson County Tennessee United States 1,716,085 578,698 6,038,803 299,398,484 Female 55.2 51.3 51.1 50.7 Male 44.6 48.7 48.9 48.3 0.2 - - - Afr American 14.3 27.9 16.9 12.8 Asian / Pacific 1.2 3.0 1.4 4.6 80.5 60.1 77.5 66.4 Hispanic 2.6 7.1 3.2 14.8 Indian American 0.1 0.4 0.3 1.0 Others 1.4 - - - 0 1.5 1.0 1.6 N Gender (%) Unknown Race/Ethnicity* (%) Caucasian Multiple Races *A significant number of SD records are of unknown race/ethnicity. Multiple efforts are underway to better classify these records including NLP on narratives. yp Ty ert en p H si yp e II D on er i D ep lipi abe d t re ss em es ia iv N e O D S is M or ix e d An de r H em C yp or ia on e N ar rli p OS y i A dem th ia er oVe C s ar A l di s ac th m M H ur a yp C Ta mu or erc na ho chy rs le ry ca s r A th tero dia er H yp o- lem N o t at ia C on hy v V ge roi di e s st l s iv m e N H O rt Fa S C ar ilu di re ac Se E D ys de ns A rh ma or y ne tria l F thm ur ib o ril ia O He la a th er rin tion Lu g L ng os D s T is Pu yp ea lm e I s on Dia e ar be y C te s o Sl ll a e e ps e p A pn ea H Examples of frequent diagnoses in total SD 70,000 60,000 50,000 40,000 Top diagnosis codes overall: 1. FEVER 2. CHEST PAIN 3. ABDOMINAL PAIN 4. COUGH 5. PAIN IN LIMB 6. HYPERTENSION 7. ROUTINE MEDICAL EXAM 8. ACUTE URI 9. MALAISE & FATIGUE 10. HEADACHE 11. URINARY TRACT INFECTION 30,000 20,000 10,000 0 ar di ac M Ve ur sc m u ou A rs re Fa te sth A m ci r al bn al R a or e A m no flu al x C H mo on P h ea al ri ie ge ys n n i De g s ta L v l H elo os ea pm s rt e G An n t o Id as io tro mo pa ly e th nte ic r Sc itis H oli yp o er sis tr op Fa hy E Sp ilu p ile ee r ch Hy e to ps y d /L an ron Thr iv gu e ag phr e os e D is is o A rd c er Ty i pe d R I D eflu x ia be Ve S te s nt ex ric ua A u l l P DH ar r D S eco ep ci t t Sl De y fe e A ep A ct bn or pn m ea al D E ow C n' A G ut s Sy is m n d H yp ro er me te ns io n C Examples of frequent diagnoses among peds in SD 9,000 8,000 7,000 6,000 Top diagnosis codes overall: • ROUTIN CHILD HEALTH EXAM • FEVER • COUGH • ACUTE PHARYNGITIS • URIN TRACT INFECTION NOS • VOMITING ALONE • CARDIAC MURMURS NEC • ABDOMINAL PAIN-SITE NOS • OTITIS MEDIA NOS • ACUTE URI NOS • PAIN IN LIMB 5,000 4,000 3,000 2,000 1,000 0 Examples of ICD-9 codes for rare diseases Example Rare Disease Frequency Number in SD Number in BioVU Microcephalus 0.00007 566 6 Pica 0.00004 59 9 Septicemic Plague 0.00004 20 0 Pick’s Disease 0.00004 72 7 Acromegaly and Gigantism 0.00041 464 57 Ehlers-Danlos Syndrome 0.00011 154 9 Narcolepsy without Cataplexy 0.00004 166 17 Spina Bifida 0.00022 1327 77 Stiff-Man Syndrome 0.00007 42 5 Tourette Syndrome 0.00007 366 9 Bell’s Palsy 0.00078 1509 141 Bulimia Nervosa 0.00021 640 35 Cushing’s 0.00116 1065 129 Peyronies Disease 0.00018 369 57 Statistical considerations and limitations Working with biostats (Schildcrout) on these issues. Some considerations: Selection bias for inclusion in population; representativeness of cohort and generalizability Bias in ICD-9 coding Confounding by indication Severity of disease Medication prescribed/ordered vs received Timing For example, AE must come after medication (timecourse) Timescale upon which events could be attributed to events Dropout (Death vs. discharge vs. transfer) Intervention based on in-hospital disease history Using the SD resource SD Access Protocol Requests IRB Exemption Researcher Enters StarBRITE to complete electronic application (IRB status is in StarBRITE) Signs DUA SD staff verify/ access granted Researcher accesses SD Data Use Agreement Components Phenotype Searching Definition of phenotype for cases and controls is critical Basic understanding of data elements; uses and limitations of particular data points is important May require consultation with experts List of ‘watch outs’ under development Reviewing records manually to make case determination (or even to calculate PPV of search methodology) will be somewhat time consuming The problem with ICD9 codes ICD9 give both false negatives and false positives False negatives: Outpatient billing limited to 4 diagnoses/visit Outpatient billing done by physicians (e.g., takes too long to find the unknown ICD9) Inpatient billing done by professional coders: omit codes that don’t pay well can only code problems actually explicitly mentioned in documentation False positives Diagnoses evolve over time -- physicians may initially bill for suspected diagnoses that later are determined to be incorrect Billing the wrong code (perhaps it is easier to find for a busier clinician) Physicians may bill for a different condition if it pays for a given treatment Example: Anti-TNF biologics (e.g., infliximab) originally not covered for psoriatic arthritis, so rheumatologists would code the patient as having rheumatoid arthritis Lessons from preliminary phenotype development (can be corrected) Eliminating negated and uncertain terms: Delineating section tag of the note “I don’t think this is MS”, “uncertain if multiple sclerosis” “FAMILY MEDICAL HISTORY: Mother had multiple sclerosis.” Adding requirements for further signs of “severity of disease” For MS: an MRI with T2 enhancement, myelin basic protein or oligoclonal bands on lumbar puncture, etc. This could potentially miss patients with outside workups, however Other lessons (more difficult to correct via algorithms) A number of incorrect ICD9 codes for RA and MS assigned to patients Evolving disease “Recently diagnosed with Susac’s syndrome - prior diagnosis of MS incorrect.” (Notes also included a thorough discussion of MS, ADEM, and Susac’s syndrome.) Difference between two doctors: Presurgical admission H&P includes “rheumatoid arthritis” in the past medical history Rheumatology clinic visits notes say the diagnosis is “dermatomyositis” - never mention RA Sometimes incorrect diagnoses are propagated through the record due to cutting-and-pasting / note reuse Resources StarPanel Record Counter De-identified clinical data; sophisticated phenotype searching Returns a number – record counts and aggregate demographics Synthetic Derivative Identified clinical data; designed for clinical use De-identified clinical data; sophisticated phenotype searching Returns record counts AND de-identified narratives, test values, medications, etc., for review and creation of study data sets BioVU SNP data De-identified clinical data; sophisticated phenotype searching Able to link phenotype information to biological sample Live Demo