Using Electronic Medical Records Systems for Clinical Research: Benefits and Challenges Prakash M. Nadkarni 1 Introduction Opportunities Availability of clinical, financial and administrative data in electronic form Challenges Using EMR Software for research operations Using EMR Data for research? Suitability of careoriented data to clinical research needs. EMRs queried directly to answer research questions 2 EMR/Clinical Research Information System (CRIS) Differences: Research Subjects Subjects are not necessarily “patients”. Personal Health Information may be optional. Not all screened subjects are enrolled. Simultaneous or sequential enrollment Eligibility Criteria 3 EMR/CRIS Differences: The Study Calendar Events/Visits and Study Calendar: Specific evaluations or interventions are done at specific time points ('events") relative to the start of the study. All patients are not enrolled at the same time. 4 EMR/CRIS Differences: Electronic Data Capture (EDC) CRIS EDC is Far More Structured and Finegrained – textual comments are only a last resort. CRISs may need to Support Real-Time Self-reporting of Subject Data CRIS EDC may not always be Real-Time. Quality Control considerations dictate many workflow steps. 5 EMR/CRIS Differences: TransInstitutional Scope For trans-institutional scope, Web technology is virtually mandated. Site restriction in Multi-Site studies – endusers and investigators access only their own site’s patients. Trans-National Issues: Software Localization/ Globalization – same software, different language/layout. 6 EMR/CRIS Differences: User Roles CRISs support differential access to studies Most users of a CRIS are unaware of the other studies in the same database. Some users have read-only access to the data; some only view reports. Only certain users may be allowed to enter data in particular forms, or even view certain "blinded" data. Data analysts typically do not need to access PHI. However, in multi-institutional studies, they are not typically site-restricted (see later) 7 EMR/CRIS Differences: Summary EMRs are intended to primarily support patient care, not research. CRISs are specifically designed for research protocols. May inter-operate with CRISs. Sub-systems: Laboratory, Pharmacy, Scheduling EMR *may* be used with structured EDC for intra-institutional studies if the only alternative is paper, or if data-entry would otherwise be duplicated. Claims by any EMR vendor that their systems are CRIS-capable should be viewed skeptically. 8 EMR Data for Research: The Nature of Electronic EMR Data Significant dependence on narrative text, which is often the gold standard for clinical findings. Using administrative/billing data as a surrogate for clinical data Miscoding, variations in coding 9 Using EMR Data for Research Primarily hypothesis suggestion/generation rather than confirmation Sample size may be too small to achieve statistical significance Most data mining tests only show association, which does not prove causation. Selection of patients matching complex criteria: sample size projections for a planned study (a strength of I2B2 – no IRB approval needed because only anonymized data is returned). 10 Medical Natural Language Processing 101 NLP is concerned with extraction of meaningful information from human language input. Ultimate goal is to transform unstructured text into a structured form. Most NLP applications are targeted toward specific goals – e.g., identification of medications, adverse drug events. NLP is not 100% accurate 11 Medical NLP 101 : Symbolic/ Rulebased approaches Linguistic / symbolic NLP approaches employ hand-crafted grammar rules to parse text into units of speech (symbols), which are then processed further. Still used successfully for limited problems. This approach does not always scale Labor-intensive, ambiguous parses, poor results with telegraphic text. 12 Medical NLP 101: Statistical NLP Relies on large bodies of text annotated with the correct answers by humans. Utilizes probabilistic methods for prediction The larger and more representative the training data, the better the results will be. Approaches include Support Vector Machines (SVMs), Hidden Markov Models (HMMs), and Conditional Random Fields (CRFs). 13 Medical NLP 101: Subproblems NLP software typically works as a pipeline of modules: Modules for Low-level tasks precede those for high-level tasks Low Level Tasks Segmentation- sentence and word boundary detection, problem-specific boundary detection Part of speech tagging Morphological decomposition of compound words Aggregation – identification of phrases 14 Medical NLP 101 : Sub-problems (2) High-level tasks Spelling and grammatical error correction Named Entity Recognition – including medical concept recognition Word /abbreviation disambiguation Negation and uncertainty identification Relationship extraction Temporal inferencing 15 Medical NLP: Practical Issues Change of Workflow and Introduction of Structure can eliminate a difficult problem. Code Reuse to avoid reinventing wheels. General vs. Specific Solutions Tools Need Commoditization 16 Querying EMR Data: Technological Considerations A database cannot be simultaneously designed for rapid query as well as efficient interactive, multi-user updates. EMR database designs are transactionoriented. EMRs are optimized for "Patient/Entity Centric", not "Attribute-Centric" queries 17 Data Warehousing 101 Principle: Operating on a separate read-only copy of the data on separate hardware yields better query performance. Structural tweaks include adding extra and precomputation of aggregate values. Special types of indexes (bitmap indexes) yield improved query performance. “Star schemas” characterize most warehouse designs. Farmers vs. Explorers (Inmon) “Virtual" integration ("federation") 18 Data Warehousing: Practical Considerations After warehouse, need for creation of custom reports may increase rather than decrease. The critical requirement for effective ad hoc query is a comprehensive understanding of the data. This is generally a full-time effort. 19 Special Considerations: Querying of Clinical Data Both EMRs and large-scale CRISs typically store clinical data in Entity-Attribute-Value (EAV) form 100,000s of clinical parameters exist across all medical domains. The vast majority of parameters will be inapplicable for a particular subject/patient. EAV is a triple: Entity=Patient+point in time, Attribute=Parameter, Value=value of that parameter. EPIC Flowsheet data uses EAV. 20 Standardization The mere presence of structure does not solve all problems Synonyms in narrative text are unavoidablereduced to the same concept. Controlled medical vocabularies (UMLS) help. UMLS is not a panacea Institutions will therefore evolve their internal controlled vocabularies. 21 Standardization Considerations Standardizing your definitions 2nd Law of Thermodynamics Poor definition quality becomes a problem if pooled-data (or meta-) analysis is intended. Features of certain systems predispose to disorder. (Learn As You Go, separate definitions databases.) Even the best system is not immune – path of least resistance. Consistent definition is difficult to achieve after the fact – Deming. 22 EMR use as the basis for research hypotheses Conflicting evidence regarding EMR benefit still appears. A *well designed* EMR may benefit. Electronic Alerting Systems themselves may not improve care, unless EMRs also reduce workload through automatic actions. Review vendor-supplied templates carefully. 23 Conclusions: Future EMR Evolution EMRs fully supporting CRIS capability are unlikely to evolve. No software should attempt to do everything Differences in storage-engine capabilities Jack-of-all-trades approach (doing everything in a mediocre manner) is not viable. Difficult (or impossible) to devise a logically consistent user-interface metaphor that applies to diverse unrelated features. Example of Microsoft Office. 24 Inter-operation (1) Co-existing and Inter-operating best-ofbreed packages offer the best usability and feature-set CRISs, Genomic / Proteomic Data Management Packages There may be minimal data duplication- e.g., EMRs may pull in very limited summary information on critical genetic data for selected patients, so that it is immediately visible. 25 Inter-operation (2) • CRIS/EMR Bulk import of laboratory parameters, to avoid duplicate data entry Automatic grading of laboratory-based adverse events (oncology studies) – Richesson et al. Use for scheduling research subject visits Pharmacy subsystem for drug dispensation EMR for primary EDC in intra-institutional studies if the only alternative is paper, or if data-entry would otherwise be duplicated. • EMR/Specialized EMR • Picture-archiving systems 26 Inter-operation (3) • Application Programming Interfaces (APIs) All large packages – CRISs, EMRs, ‘Omics – require APIs to make inter-operation efficient APIs are vendor-specific. Inter-operation standards (e.g., the HL7 Virtual medical record) have not received much traction. Currently, many vendors set unreasonable financial and other barriers to use of their APIs (e.g., official certification, withholding of documentation). EMRs lag in the software industry’s trend toward open-source. 27 Questions? 28 Further reading CRIS NLP Richesson and Andrews, Clinical Research Informatics, 2012 (Springer) Jurafsky and Martin: Natural Language Processing Manning and Schuetze: Foundations of Statistical Natural Language Processing Nadkarni, Ohno-Machado and Chapman: Natural Language Processing: An Introduction. Journal of the American Medical Informatics Association 2011. Data Warehousing Larry Greenfield. The Data Warehousing Information Center. www.dwinfocenter.org/ Kimball, Reeves, Ross and Thornthwaite. The Data Warehouse Lifecycle Toolkit : Expert Methods for Designing, Developing, and Deploying Data Warehouses. Wiley, 1998. 29