Using Electronic Medical Records for Research: Practical Issues and Implementation Hurdles Prakash M. Nadkarni MD 1 Benefits of EMRs Most of the data that you want is often in the EMR Sample Size Analyses Cohort identification /recruitment Detail Data You can implement many research related workflows Appointment scheduling enables interventions at the patient's convenience. 2 EMRs don't do everything Even Epic warns you about the need to interoperate with software designed specifically for clinical research (CRIS=Clinical Research Information System). Even CRISs are sub-specialized: Project management/finance, grant management workflows, federal paperwork (FDA Investigational New Drug applications), general or specialized data capture (e.g., patient diaries, adaptive questionnaires). 3 Challenge: No Study Calendar All patients are not enrolled at the same time. Specific evaluations or interventions are done at specific time points ('events") relative to start of participation in the study (or some arbitrary point- e.g., working backwards from a scheduled MRI scan). Each time point may have a permissible range or window (e.g., “6-mth follow up” may occur between 5-7 months). Given a protocol/study calendar, a CRIS will *generate* a provisional patient calendar. 4 Study Calendar (2) The protocol is worked out based on information yield of the evaluation and expected rate of change in the parameters evaluated, evaluation cost and patient risk. An Event-CRF Cross-Table enforces consistency. CRISs use "Unscheduled" events to deal with emergency conditions. An entire set of reports are calendar-driven – e.g., scheduled events, missing forms, out-of-range visits. In Epic, the closest to Calendar functionality is the Chemotherapy module (Beacon) 5 Non-adherence to Standards If vendor ignores national/international controlled terminology standards, data pooling in cross-institutional collaborations is difficult For procedures, Epic does not use Clinical & Procedural Terminology (CPT). Instead, procedures are identified by idiosyncratic abbreviations created by hurried users, that are hard to interpret except by those users, and vary across institutions. 6 Standards Challenges (2) Of the 15,000 laboratory tests in our instance of Epic, only about 8% have been mapped currently to the Logical Observations, Identifiers, Nomenclature and Codes (LOINC) vocabulary. Sometimes the same procedure or lab test is defined more than once in a master table the definitions are unhelpful, and one must look at the actual data to determine which are used, e.g., histogram showing number of tests performed over a period of time, the max and minimum values. 7 Redundancy and heterogeneity The data may have been stored more than once, and in different ways, in different parts of the medical record BMI is recorded in two different places. "Uncontrolled" local terminologies Flowsheets where Blood pressure is recorded redundantly as text "124/82". (Not in UIHC, fortunately.) Procedures and Lab definitions list are also semi-controlled. 8 Duplicate Elements Pseudo-redundancy: Subtly different data elements that are given the same label in the user interface Baby's birth weight is recorded both at the time of delivery and at the time of admission to a NICU. The two are not semantically the same: with interventions, the former may be significantly more (or less) than the latter. 9 “Wrong” structure Much data (discharge summaries, etc.) is stored as text, requiring human abstraction or Natural language processing (NLP). NLP is not 100% accurate, requiring sensitivity and specificity to be traded off. It is especially hard with progress notes that are replete with abbreviations and that may have little grammatical structure. Much of the published NLP work relies on idiosyncrasies of a particular dataset (e.g., the use of Epic templates) to achieve higher accuracy, and is not always generalizable. 10 The Needle in the Haystack Epic schema contains several thousand tables; many unused, or with empty fields. Incomplete or out-of-date documentation. The first time, one may spend more time locating a particular data element than actually pulling it out. Persons doing data extraction need to add value by providing signposts and tips, to help others who have to do the same task later. Even with a data warehouse, this problem will reoccur as long as data definitions are suboptimal 11 Real-time cohort identification must be done judiciously "Best Practice Alerts" can be a resource drain on responsiveness of systems. Do you really need real-time subject identification? Would a 24-hour delay be acceptable? ICU-related clinical studies; transfusion in preemies. 12 Transforming the Data The form in which data is recorded in the EMR is not necessarily the form in which it is most conveniently analyzed or reported. Registries often require creating derived variables Converting numerical data into categories – e.g., Binning children by birth weight Converting numeric values or existence/absence of data into Yes/No: Is the bilirubin > 5 mg/dl? Did the neonate receive nitric oxide inhalation for pulmonary hypertension? 13 Interfacing with statistical software Before: sample size, randomization After: Analysis, fitting to models Some CRISs (e.g., REDCap, TrialDB) will output SAS/SPSS-formatted data files, with definitions for all variables (including enumerations for all categorical variables; SAS has a command called PROC FORMAT for categorical data). EMRs still lag. 14 Data Warehouse A database that is optimized for fast query, preferably by end-users, without interactive updates Solves some problems, but not others More homogeneous structure – i.e., a handful of tables rather than thousands. However, the problem of locating variables of interest doesn't go away. With indifferent documentation of the variables, the problem of hunting for variables of interest is transferred from the concierge/analyst to the end-user, which may worsen the problem. 15 Special Challenges in EMR Data Interpretation /Reliability Data entry errors in source data, often a consequence of “copy and paste”. Coding of categorical variables does not accommodate nuances in the medical history or diagnostic findings. Depending on the source, billing data may have been up-coded (Humana). Outcome data may be lacking – absence of return visit data may simply mean that patient failed to improve and went elsewhere. 16 Special Challenges (2) Data fragmentation – especially where healthcare is provided by separate institutions. Data is observational – treatments and exposures are not assigned randomly. Confounding Bias – socioeconomic factors might lead patients to use suboptimal treatments Selection/sampling Bias – atypical demographical attributes for the cohort whose data you are seeing, may limit inferences that you can make about the general population. 17 Frontiers: Genetic Data There are no technical barriers to the incorporation of limited genetic data for an individual– e.g., SNPs or specific mutations – in structured (i.e., readily analyzable) form. Major current issue is the limited understanding of genetic data and definitions by EMR vendors. Whole-genome is still a long-way off. A single record would be larger than the bulk of existing non-image EMR data. 18 Conclusions None of the challenges are insurmountable, but they take a lot of effort and resources to address Most of the fixes are long-term, involving: Manual mapping to controlled vocabulary terms Change in processes Maintaining descriptive documentation that must continually be checked for usability and currency. 19 Further Reading Masys DR, et al . Technical desiderata for the integration of genomic data into Electronic Health Records.J Biomed Inform. 2012 Jun;45(3):419-22 Nadkarni, Ohno-Machado and Chapman. Natural Language Processing: A Tutorial. Journal of the American Medical Informatics Association, 2011. PMC3168328 Hoffman & Podgurski, “Big, bad data” Journal of Law, Medicine and Ethics, (2013) 41:1,pp 56-60. http://www.ncvhs.hhs.gov/130430b6.pdf. 20 Questions? 21