Validating EMR Audit Automation Carl A. Gunter University of Illinois Accountable Systems Workshop Root Problem Statement Situation • Access to hospital Electronic Medical Record (EMR) data suffers risk of high loss in the event of false negatives (incorrect refusal of access). – Example: doctor acting on an emergency cannot get access to list of allergies. • Hospital has highly trained personnel in whom much trust is vested. Consequences • Hospital access systems give liberal access to records, relying on accountability. • Insider threats are serious and abuses are widely documented. • Accesses are too numerous to review manually by experts. • Automated support is required. Validation Problem Statement Ideal Approach • Obvious approach: develop anomaly detector (AD) with rules and train classifiers on bad and good accesses. • Run the AD on the audit logs and investigate positives manually with domain experts Problem • This requires considerable dependence on experts. • Assumes experts know how to provide labels. • Assumes experts can formulate rules. • Assumes labeled training sets exist and that researchers will be able to get access to them. Primary Validation Approach • The primary validation approach applied by researchers in this area can be called the Random Object Access Model (ROAM). • ROAM is based on the premise that anomalous users and accesses look random. • Strategy – Develop rules and train classifier on real data set augmented with synthetic random users and accesses. – Test ability to recognize random users or accesses. ROAM Assessment Pro • Likely that illegitimate accesses appear random. • Good ROAM classifier prepares for expert review to identify false positives. • ROAM classifier may find legitimate but interesting hospital information flows. • Provides a ready testing strategy reminiscent of “fuzzing”. Con • There no current quantified evidence that random accesses and illegitimate accesses have strong overlap. • Indeed, there is evidence that in some cases legitimate accesses look random. • Some illegitimate accesses may be systematic in ways that defy detection by ROAM classifiers. Beyond ROAMing • What are the prospects for alternative models? • Example: introduce specific attacks experienced “in the wild” similar to network traces enriched with known attacks. • Another idea: look at problems like masquerading and open terminals. • Behaviors are not random, but may display learnable characteristics. Random Topic Access Model (RTAM) Explored an alternative validation model based on topic classification. Idea: • Patients are “documents” and diagnoses, drugs, etc. are their “words”. • Use Latent Dirichlet Allocation (LDA) to learn topics that can be used to classify patients. • Use this to characterize users as readers of documents. • Detect unusual readers. • Detect readers of random topics. Modeling and Detecting Anomalous Topic Access, Siddharth Gupta, Casey Hanson, Carl A. Gunter, Mario Frank, David Liebovitz, and Bradley Malin. IEEE Intelligence and Security Informatics, June 2013. Topic Distributions Neoplasm Topic Obstetric Topic Diagnosis Topics Kidney Topic Multidimensional Scaling: Patient Diagnosis RTAM: Random Users • r ~ Dir(𝛼) with n dimensions, where n is the number of topics. a.) Direct or Masquerading User (α<1) : an anomalous user of some specialty gains sole access to the terminal of another user in the hospital. b.) Purely Random User (α=1): user is characterized by completely random behavior, with little semantic congruence to the hospital setting. c.) Indirect User: user type resembles an even blend of the topics of many specialized users. Random Topic Access Detection (RTAD) • Random Topic Access Detection (RTAD): an anomaly detection framework that generates synthetic users using RTA and applies a standard spatial outlier, knearest neighbor k-NN detection scheme for classification. • Methodology 1. LDA: define patient topics, and user typing to represent users in the topic space. 2. RTA user injection: generate three types of anomalous users and insert into each role at a 5% mix rate. 3. Detection (k-NN): if the ratio of the avg. distance from a user to its k nearest spatial neighbors to the avg. pairwise distance among those neighbors is greater than a threshold, call the user anomalous. 4. Evaluation Metric: best Area Under the Curve (AUC) for each 𝛼 , role combination. Results - I The best AUC across all evaluated dimensions is plotted for each role performing poor for 𝛼 > 1 . Results - II The best AUC across all evaluated dimensions is plotted for each role performing well or near average for 𝛼 > 1. Discussion and Conclusions • Other strategies besides ROAM may capture new types of threats. • Good progress on technical measures of validation; need links to expert review and ground truth. • More evaluation studies are needed. • Important to integrate access audit with general business intelligence: understanding the roles and workflows of the organization.