Case Study: Merging EMR data C S d M i EMR d from VA hospitals from VA hospitals Merging data from Hospitals using the same EMR system – yields massive amounts of data; however, even the same systems will vary systems will vary Brian Nordberg Brian Nordberg Data Manager University of Utah Background VHA developed VistA from the ground up p g p Deployed at all 130 VA sites and CBOC’s ~ 27 million patients in our MPI 27 million patients in our MPI ~ 845 million outpatient encounters (as of 6/10/2010) • Individual sites have tailered VistA for their own q unique uses • As much as 60% of all data in VistA are in text notes 100 different “packages” packages • ~ 100 different • • • • • • • • Pharmacy Surgery Medicine Radiology National Data ‐The Good, Bad and Ugly Bad and Ugly • Good – Already merged at a national level and data are very useable bl • • • • • Outpatient Encounters, ICD9, CPT ( g , g , , , p, , g ) Vitals (Height, Weight, BP, Pulse, Temp, Pain Score, Hearing..) Demographics Inpatient visits, ICD9, CPT B d ti /C t Budgeting/Cost • Bad – Merged, but have issues • Pharmacyy • Laboratory • Microbiology • Ugly U l • Orders Possibly everything else – everything else we we won’t know until we pull it. won t know until we pull it. • Possibly • Notes ‐ 569,176,954 for 2 regions Notes in VistA by Year Notes in VistA by Year 90000000 80000000 70000000 60000000 50000000 40000000 30000000 20000000 10000000 0 1999 2000 2001 2002 2003 Sum of Encoutners 2004 2005 Sum of Notes 2006 2007 2008 2009 Areas of Focus Areas of Focus • Metadata –Where are the data, what d h do they mean? ? • Data pull Data pull “methods” methods and validation and validation • Data coding differences and data Data coding differences and data profiling Metadata VistA contains 78,301 distinct data fields over 9675 different files (~tables) The Medicine Package 1 of ~100 packages in VistA Metadata Sources Metadata Sources • Corporate Data Warehouse (fully merged data) • Outpatient data • Vitals • More on the way M th • VA Information Resource Center (VIReC) (Data in files by VISN per year ‐ so 23 files per year) files by VISN per year 23 files per year) • Medical SAS datasets • Encounters • Demographics • Decision Support Systems • • • • Labs (~50 most frequent lab tests) Utilization Some pharmacy (150 most frequent PHA prescriptions) (150 most frequent PHA prescriptions) Budget and Cost VistA Files and Fields Metadata Simple Queries of Metadata Simple Queries of Metadata Here we were looking for Ejection Fraction. This gives us a starting point to attempt to find the needed data for the study Next we spoke to some experts at a station who indicated they were putting the data in those fields. So we pulled those tables Ejection Fraction Study Discrete Ejection Fraction fields (698.1, 698.8) fields (698.1, 698.8) populated VHA Stations Echo Reports 600 640 * 2 10391 653 132 660 1 691 910 * St ti * Station with whom we discussed data ith h di dd t Other Stations Put EF Data in Notes Other Stations Put EF Data in Notes Discrete Ejection Fraction fields (698.1, 698.8) populated VHA Stations Echo Reports 600 Text Notes referencing “Ejection Fraction” VHA Stations Notes 2 600 44948 10391 640 76584 653 132 653 11482 660 1 660 64264 691 910 691 54785 640 * Metadata tell us where, next is how we pull the data SSeveral ways to pull data from MUMPS, but different methods can yield different l ll d f S b diff h d i ld diff results • MUMPS Data Extractor – very expensive software – not many stations have it • Custom M code – C t M d now frowned upon as it can bring VistA to its knees f d it b i Vi tA t it k • Shadow (replicated) VistA system – corporate data warehouse uses this method 10//1/2001 12//1/2002 2//1/2004 4//1/2005 6//1/2006 8//1/2007 10//1/2002 12//1/2003 2//1/2005 4//1/2006 6//1/2007 8//1/2002 10//1/2003 12//1/2004 2//1/2006 4//1/2007 6//1/2002 8//1/2003 10//1/2004 12//1/2005 2//1/2007 4//1/2002 6//1/2003 8//1/2004 10//1/2005 12//1/2006 2//1/2002 4//1/2003 6//1/2004 8//1/2005 10//1/2006 12//1/2001 2//1/2003 4//1/2004 6//1/2005 8//1/2006 10//1/2001 12//1/2002 2//1/2004 4//1/2005 6//1/2006 8//1/2007 1//1/2000 3//1/2001 5//1/2002 7//1/2003 9//1/2004 11//1/2005 1//1/2007 3//1/2002 5//1/2003 7//1/2004 9//1/2005 11//1/2006 Validating Data Pull Methods • Microbiology accessions by site by month for 2 different methods Blue was supposed to be our “Gold Gold Standard Standard” after review we after review we • Blue was supposed to be our invalidated our “Gold Standard” 3000 2500 2000 1500 1000 500 0 J A B C D E F VA Station by Month/Yr G H I J A B C D E F VA Station by Month/Yr G H I J 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 7/1/2009 4/1/2009 1/1/2009 10/1/2008 Data Pull Method 2 Compared to yet another “gold standard” much closer, but unable to get exact 1600 1400 1200 1000 800 600 400 200 0 Aggregating the Data and Handling Data Conversions Profiling for Data Coding and Conversions g g • Do the data conform to the defined values or range of values expected • Data Types – Alpha, Numeric, Date • Outliers Outliers – note, transform or remove note, transform or remove Vital Types Counts Min Result Max Result Avg Result StdDev Result Height 77,506,463 ‐18 77,295 69.1 25.6 Weight 150,553,303 ‐5 2,778,808 200.4 21,827.4 • Larger data types take up more database space, but if sized too small imports will fail or data will get truncated bigint Integer (whole number) data from ‐2^63 (‐9,223,372,036,854,775,808) through 2^63‐1 (9,223,372,036,854,775,807). int Integer (whole number) data from ‐2^31 (‐2,147,483,648) through 2^31 ‐ 1 (2,147,483,647). smallint Integer data from ‐2^15 (‐32,768) through 2^15 ‐ 1 (32,767). tinyint Integer data from 0 through 255. Data Coding Data Coding Outliers may skew analysis y y Average Pain Scores by VISN OOCH! 180 Do we convert, delete or leave and note in Metadata Metadata, regardless we need to discuss with the people coding at 23 160 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 VISN 12 15 16 17 18 19 20 21 22 23 Data Conversions Data Typing Data Conversions – Data Typing MUMPS is not a strongly typed database, so it allows invalid dates, times, numerics… • VistA stores dates and times as a string. To convert, simply add 1700 to the first 3 digits, the next 2 digits are month… • SQL (Oracle, SQL Server) Datetime Data types cannot contain missing days AdmitDate Converted AdmitDate 3000306.215 3000316.101 3/6/2000 21:50 3/16/2000 10:11 2840913 9/13/1984 0:00 9/13/1984 0:00 3000316.102 3/16/2000 10:18 30003 3//2000 Admit Date Complete Missing Day Missing Hour Missing Minute Data Conversions Data Conversions Race 14000000 12000000 10000000 8000000 6000000 4000000 2000000 0 711,474 patients have different race categories from different stations Data Conversion Data Conversion Lab data • 36,433 different laboratory test names (FY 07 – , y ( FY09) • Similar test names may contain very different Similar test names may contain very different results • Example – Example CREATININE – CREATININE very similar test names, very similar test names very different results • Combining tests requires clinical knowledge of C bi i i li i l k l d f tests and possible result values Creatinine Lab Test Names Normal levels of creatinine in the blood creatinine in the blood are ~ 0.5 to 1.7 milligrams Min Result Max Result Avg Result CREATININE (mg/24 Hr) CREATININE CLEARANCE CREATININE CLEARANCE 125.0 125 0 125.0 265.5 30 3.0 4857.7 133 0 133.0 CREATININE(CRT), URINE MERCURY/CREAT RATIO 3.0 1.0 0.2 1.0 1.9 1.0 MICROALBUMIN/CREAT. RATIO 837.0 0.0 6486.0 84.5 330.5 ALBUMIN CREATININE RATIO ALBUMIN CREATININE RATIO AMYLASE ISOENZYMES (PANCREATIC) 838.0 2.0 1213.0 39.0 103.5 1.0 46.0 46.0 7961.0 396.0 2.9 1.0 56.7 240.7 1.0 145027.0 1.0 0.0 1.0 787.0 14.0 4.0 3164 0 3164.0 40.0 63.0 8833.0 4.0 0.0 1.0 01 0.1 33.2 0.4 0.2 833.8 126.2 1.2 12 2 12.2 225.4 3.1 254.6 2474.8 5512.0 12.0 66.0 226.0 62575.0 1.0 11727.0 1.2 0.8 0.6 14.0 0.1 12.0 0.3 3520.0 2.0 695.3 247.0 316.9 12.0 180.7 65.0 1.1 496.1 132.6 99.2 CREATININE(ua‐random) 3.0 45.3 120.2 73.5 40.7 CREATININE,urine CREATININE,urine(prior 3‐06) 3.0 4.0 1.1 0.9 2.1 2.0 1.7 1.5 0.5 0.5 15.0 0.9 162.2 22.4 53.3 179.0 0.5 2.5 1.2 0.3 BUN/CREAT RATIO COMPUTED CREATININE CLEARANCE CREAT 24H CONC(DC'd 9‐07) CREATININE Normal creatinine urine Normal creatinine urine value is ~ 27‐ 260 ml Counts Standard Dev CREATININE (FLUID only) CREATININE (O) CREATININE (PRIOR TO 8‐10‐04) CREATININE (PRIOR TO 8 10 04) CREATININE (random urine) CREATININE (Ref.Lab) CREATININE (Serum) CREATININE (urine) ‐ mg/24HRS CREATININE (urines) CREATININE {St.} CREATININE 24H CONC Creatinine Clearance Result Creatinine Serum Result CREATININE(sera,blood) CREATININE(serum/plasma) CREATININE(spot ua or fluids) CREATININE,urine24hr, not clrnce CT CREATININE 1406.0 54 5 54.5 0.9 1.0 NULL 702.6 28 8 28.8 0.9 46.0 NULL 16.8 62.6 1.0 NULL 6.0 22.4 1.1 15 1.5 100.2 1.4 2.1 1855.2 5.8 36.1 28.0 44.1 0.1 12 1.2 45.3 0.6 10.1 891.3 313.0 522.3 1.0 0.3 13.6 91.9 74.7 41.3 1.4 1.6 12.0 NULL 1.3 1.8 Different Coding Practices g • VistA VistA has a complex heirarchial has a complex heirarchial database • Many discrete fields to store data Many discrete fields to store data • Sites may choose to store data in those fields – or choose other fields • Ejection Fraction • Blood Pressure – Blood Pressure may be may be stored in Discrete or Text Coding of Text Documents at Each Site Coding of Text Documents at Each Site Study: Review of History and Physical Notes for Coding of Hospital Acquired Infections I I sampled all records containing “History” to avoid “&” vs “AND” ‐ l d ll d t i i “Hi t ” t id “&” “AND” 112,274 records 112 274 d Sta3n Document Titles – for History and Physical TextDocuments 436 HISTORY & PHYSICAL 442 HISTORY & PHYSICAL (BURROWS) 601 442 HISTORY & PHYSICAL (FERMELIA) 85 575 HISTORY & PHYSICAL NOTE 77 442 HISTORY & PHYSICAL TEMPLATE 660 HISTORY & PHYSICAL* 442 HISTORY AND PHYSICAL HISTORY AND PHYSICAL 12 591 12,591 554 HISTORY AND PHYSICAL 36,068 442 HISTORY AND PHYSICAL CONSULT REPORT 554 HISTORY AND PHYSICAL EXAM SCB HISTORY AND PHYSICAL EXAM SCB 442 HISTORY AND PHYSICAL O&E 554 HISTORY CBCB TOTAL 30,470 1,212 657 1,353 22 906 22,906 145 6,109 112,274 H&P would have been missed Sta3n Document Titles – for History and Physical 660 H&P * CARDIOLOGY 660 H&P **SURGERY PRE‐OP Text Documents 206 4,829 554 H&P FOR VISUAL IMPAIRMENT SERVICES 27 660 H&P GEM EVALUATION 21 660 H&P GENERAL SURGERY ADMIT 86 660 H&P GERIATRICS 17 660 H&P MEDICAL STUDENT 98 H&P MEDICINE ADMIT 660 H&P MEDICINE ADMIT 781 660 H&P MEDICINE INTERN ADMISSION NOTE 22,894 660 H&P MEDICINE RESIDENT ADMISSION 19,729 660 H&P MEDICINE RESIDENT ADMIT NOTE 660 H&P MEDICINE STUDENT ADMISSION 660 H&P MEDICINE SUB‐I ADMIT NOTE 660 H&P MH 3A PSYCHIATRIC ADMIT 660 H&P MH HOMELESS PRIMARY CARE PROVIDER NEW PATIENT 660 H&P MH HOMELESS PROGRAM NOTE 660 H&P MH SUBSTANCE ABUSE TREATMENT 966 6,129 628 4,445 112 22 227 660 H&P MICU RESIDENT ADMISSION 4,849 660 H&P NEUROLOGY 1,220 660 H&P NEUROSURGERY 660 H&P PM&R INPATIENT ADMISSION 660 H&P PM&R INTERDISCIPLINARY ADMIT NOTE 660 H&P PODIATRY 660 H&P PRIMARY CARE 660 H&P PRIMARY CARE MID LEVEL 660 H&P PRIMARY CARE NOTE 660 H&P PRIMARY-CARE PROVIDER NEW-PATIENT 660 H&P PSYCHIATRIC ADMIT 660 H&P PSYCHIATRY ADMIT 660 H&P SURGERY 660 H&P SURGERY (INPATIENT) 11 175 718 71 2,430 898 1,528 29,012 218 2,873 11,826 539 660 H&P SURGERY CONSULT 2,165 660 H&P SURGERY UPDATE 1,221 660 H&P UROLOGY 660 H&P VASCULAR SURGERY ADMIT 45 72 660 H&P WOMEN'S 29 660 H&P** PRE‐OP MULTIDISCIPLINARY NOTE 26 TOTAL 121,143 What Do We Do? What Do We Do? • Metadata ‐ data about the data. More time needs to be spent with data owners to document • Data stewards need to understand the coding g practices in their institutions and researchers need to work with data stewards • Data validation – each data pull should be validated against some standard against some standard • Data profiling ‐ each dataset will need to be analyzed for what it contains the range of values for what it contains, the range of values… • Standardization – Working with standards bodies at the hospital and national levels the hospital and national levels