Decision support from local data: Creating adaptive order
menus from past clinician behavior
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
Klann, Jeffrey G., Peter Szolovits, Stephen M. Downs, and
Gunther Schadow. “Decision Support from Local Data: Creating
Adaptive Order Menus from Past Clinician Behavior.” Journal of
Biomedical Informatics 48 (April 2014): 84–93.
As Published
Author's final manuscript
Thu May 26 19:40:09 EDT 2016
Citable Link
Terms of Use
Creative Commons Attribution-NonCommercial-NoDerivs
Detailed Terms
NIH Public Access
Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
NIH-PA Author Manuscript
Published in final edited form as:
J Biomed Inform. 2014 April ; 48: 84–93. doi:10.1016/j.jbi.2013.12.005.
Decision Support from Local Data: Creating Adaptive Order
Menus from Past Clinician Behavior
Jeffrey G. Klann, PhD*,a,b,e,1, Peter Szolovits, PhDc, Stephen Downs, MD, MSd,e, and
Gunther Schadow, MD, PhDe,2
of Computer Science, Massachusetts General Hospital, One Constitution Center,
Suite 200, Boston, MA 02129
Medical School, 25 Shattuck St, Boston, MA 02115
Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology,
Stata Center, 32 Vassar St, 32-254, Cambridge, MA 02139
NIH-PA Author Manuscript
Health Services Research, Indiana University School of Medicine, 410 W. 10th Street,
Suite 1000, Indianapolis, IN 46202
Regenstrief Institute for Health Care, 410 W. 10th Street, Suite 2000, Indianapolis, IN 46202
Objective—Reducing care variability through guidelines has significantly benefited patients.
Nonetheless, guideline-based clinical decision support (CDS) systems are not widely implemented
or used, are frequently out-of-date, and cannot address complex care for which guidelines do not
exist. Here, we develop and evaluate a complementary approach - using Bayesian network (BN)
learning to generate adaptive, context-specific treatment menus based on local order-entry data.
These menus can be used as a draft for expert review, in order to minimize development time for
local decision support content. This is in keeping with the vision outlined in the US Health
Information Technology Strategic Plan, which describes a healthcare system that learns from
NIH-PA Author Manuscript
© 2013 Elsevier Inc. All rights reserved.
Corresponding Author: [email protected], ph: 617-643-5879, fax: 617-643-5280.
1Dr. Klann is no longer affiliated with eThe Regenstrief Institute for Health Care
2Present Address: Pragmatic Data LLC, 8839 Rexford Rd., Indianapolis, IN 46260
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our
customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of
the resulting proof before it is published in its final citable form. Please note that during the production process errors may be
discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Dr. Klann designed and implemented the study and wrote the manuscript.
The other authors served as advisors, helping to conceptually devise portions of the study, revise the methodology and implementation
strategy, and provide feedback on the study design. The authors each offered particular expertise: Dr. Szolovits in machine learning
approaches on clinical data; Dr. Downs in decision modeling and Bayesian networks, and Dr. Schadow in clinical data mining and
data analysis.
All authors also edited, contributed to, and approved the manuscript.
None declared.
Klann et al.
Page 2
NIH-PA Author Manuscript
Materials and Methods—We used the Greedy Equivalence Search algorithm to learn four 50node domain-specific BNs from 11,344 encounters: abdominal pain in the emergency department,
inpatient pregnancy, hypertension in the urgent visit clinic, and altered mental state in the
intensive care unit. We developed a system to produce situation-specific, rank-ordered treatment
menus from these networks. We evaluated this system with a hospital-simulation methodology
and computed Area Under the Receiver-Operator Curve (AUC) and average menu position at time
of selection. We also compared this system with a similar association-rule-mining approach.
Results—A short order menu on average contained the next order (weighted average length
3.91–5.83 items). Overall predictive ability was good: average AUC above 0.9 for 25% of order
types and overall average AUC .714–.844 (depending on domain). However, AUC had high
variance (.50–.99). Higher AUC correlated with tighter clusters and more connections in the
graphs, indicating importance of appropriate contextual data. Comparison with an association rule
mining approach showed similar performance for only the most common orders with dramatic
divergence as orders are less frequent.
NIH-PA Author Manuscript
Discussion and Conclusion—This study demonstrates that local clinical knowledge can be
extracted from treatment data for decision support. This approach is appealing because: it reflects
local standards; it uses data already being captured; and it produces human-readable treatmentdiagnosis networks that could be curated by a human expert to reduce workload in developing
localized CDS content. The BN methodology captured transitive associations and co-varying
relationships, which existing approaches do not. It also performs better as orders become less
frequent and require more context. This system is a step forward in harnessing local, empirical
data to enhance decision support.
clinical decision support; data mining; Bayesian Analysis
NIH-PA Author Manuscript
A currently popular approach to improving the quality of health care is to make sure that
similar cases are handled in similar ways, i.e., to reduce the variability of care. [1]
Frequently this is accomplished through propagation of external protocols into practice,
through mechanisms such as Clinical Decision Support (CDS). [2]
Unfortunately, computable CDS content is extremely expensive and time-consuming to
create [3], maintain [4], and localize [5]. Consequently CDS has been much more slowly
adopted than other components of Health Information Technology (HIT) [6]. Even when
CDS available, the content is frequently inappropriate or incorrect. [7] Various projects are
BN: Bayesian Network
ARM: Association Rule Mining
CPT: Conditional Probability Table
GES: Greedy Equivalence Search
ITS: Iterative Treatment Suggestion (the methodology defined in this manuscript)
UVC: Urgent Visit Clinic
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 3
being undertaken to standardize computable CDS content in order to reduce the local
implementer’s work (e.g., [8]).
NIH-PA Author Manuscript
Still, standardized CDS does not address the following issues: the frequency of content
change in medicine, physician attitudes toward guidelines, and terminology challenges.
First, much content, both routine and complex, is not distilled into guidelines. [9] This might
be quite common; in one study, the literature provided answers to primary care providers’
routine clinical questions only 56% of the time. [10] Second, studies have shown that
physicians value colleagues' advice at least as much as guidelines. [11] This might be
because medicine is locally situated, and colleagues can provide a local frame of reference
through which to decide if and how external guidelines relate to particular local cases. [12]
Third, standardized content databases require translation of codes into standard
terminologies, which is difficult and frequently causes failures in interoperability.
NIH-PA Author Manuscript
Electronic Medical Record (EMR) data is rapidly proliferating [13], in part due to the
Meaningful Use incentive program. [14] These data offer the opportunity to harness local
physician wisdom-how care is actually delivered - to augment and suggest protocols, vastly
decreasing human effort in developing CDS content and making knowledge available in
complex scenarios. It is possible to partially reconstruct physician decisions by aggregating
the millions of treatment events in medical record systems. Such locally generated CDS
content avoids the three issues discussed above. This fits into the Office of the National
Coordinator for HIT’s strategic plan, which centers on building a “learning healthcare
system” that can perform dynamic analysis of existing healthcare data to glean various
information, including best practices. [15]
1.1. The Wisdom of the Crowd
Despite the incompleteness of guidelines and poor maintenance of expert-curated CDS,
individual physician behavior is not reliable either. Studies show that care continues to be
widely variable and that physicians’ treatment does not align well with guidelines. [16]
Therefore we suggest two important goals in the design of a CDS tool based on local
NIH-PA Author Manuscript
First, the average behavior of many physicians is usually much better than any individual
physician. Condorcet’s jury theorem, upon which voting theory is grounded, proves that
when each member in a group of independent decision makers is more than 50% likely to
make the correct decision, averaging those decisions ultimately leads to the right answer.
[17] If we believe that a physician is more likely than chance to make the correct decision,
we can trust the averaged decision. The theorem does have two important caveats. First, it is
only guaranteed to apply to binary choices (plus an unlimited number of irrelevant
alternatives). [18] Thankfully, many high-level medical decisions are of this type (e.g., “do I
anticoagulate this patient or not? ”). Second, crowd wisdom can become crowd madness
when decision-makers are not truly independent but are influenced by some outside entity.
[19] And of course, practitioners are influenced by colleagues, formularies, available
equipment, local culture, etc. The Dartmouth Atlas project has found that the quality of care
in a region is profoundly influenced by the ‘ecology’ of healthcare in that region, including
resources and capacity, social norms, and the payment environment. [20]
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 4
NIH-PA Author Manuscript
This leads out second goal design requirement. Even when averaging decisions, it is
impossible to guarantee that results are not influenced by these caveats. Therefore we do not
seek to replace manual content development with automatically generated CDS content.
Instead, our goal is to complement content development with knowledge distilled from EMR
data. To this end, it was important to choose a data mining approach which produces output
that a human expert could understand and update before inserting it into a clinical system.
1.2. Mining EMR Data
NIH-PA Author Manuscript
A handful of studies have explored methods to abstract treatment decisions captured in EMR
data into knowledge bases [21–25] or to find knowledge on-demand [26]. The majority of
work in abstracting EMR data have used variations of’s pairwise associationrule mining (ARM) algorithm [27], which has shown good results when capturing global
linkages where little variability exists (e.g., drugs used for HIV treatment). [28] However,
researchers have struggled with both transitive associations and the long, static lists of
associations that do not take context into account. In one case, the results of such an
approach required a great deal of manual editing before incorporation into a decision support
system. [29] Other studies have used this approach only as a rudimentary starting point for
content developers. For example, the condition-treatment linkages in RxNorm were
‘jumpstarted’ by this approach. [24]
Bayesian Networks (BNs) are an appealing alternative for mining wisdom from EMR data.
BNs are a powerful multivariate, probabilistic reasoning paradigm that naturally model
interactions among associations. BNs have a two-phase lifecycle. First, they are constructed,
either by hand-which has been widespread in medical informatics research (see e.g., [30]) or more recently from databases of observational data. [31] Such ‘structure learning
algorithms’, as they are called, take into account transitive associations and co-varying
relationships that pairwise rule mining cannot. Therefore, BN structure learning might be
able to make sense out of the tangled correlations in clinical data that have hampered other
approaches. The second phase of the BN lifecycle is its use-rather than being static networks
or rules, BNs enable rapid, iterative exploration of decisions as context evolves.
NIH-PA Author Manuscript
In a previous study, we piloted a BN approach to produce static order menus for
complications of inpatient pregnancy. [32] Our results were very promising, but our
scenarios were fixed, they only explored one small domain of medicine, and they relied on
the opinion of a single nurse practitioner to evaluate our results. In this study, we more fully
flesh out our previous work to use BNs to learn the typical successions of orders made by
clinicians for a variety of types of cases. Next, we build a recommendation system that
responds adaptively to suggest the most common next orders based on what has been
ordered and diagnosed previously. Third, we evaluate this system on hospitalization orderentry data in a multitude of scenarios across four domains. Finally, we undertake a brief
comparison of this dynamic approach to a static ARM-like approach.
1.3. Objective
Our goal was to develop a methodology to produce adaptive, patient-tailored, situationspecific treatment advice from order-entry data, which can be used as a draft for expert
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 5
NIH-PA Author Manuscript
review, in order to minimize development time for local decision support content. We used
Bayesian Networks because of their adaptive nature and their ability to account for transitive
associations and co-varying relationships. Also, they are human-readable and could
therefore be curated by a human expert. We built and evaluated a recommendation system
that dynamically suggests the most common next order based on what has been ordered
previously. We also compared it to a static ARM-like approach.
2.1. Bayesian Networks and Induction from Data
NIH-PA Author Manuscript
A BN is a directed graph of vertices (nodes) and edges connecting those vertices. Embedded
in each node is a conditional probability table (CPT), which specifies the probability of each
node state given the state of each parent. In this work, we learn BNs that represent the
probabilistic relationships among orders and diagnoses. Then, as specific orders are placed
and diagnoses made in a specific case, we instantiate the variables corresponding to those
actions in the network (known as evidence), which revises the probabilities for other orders
in the BN to the posterior probability that they would be placed conditioned on the previous
actions. This allows us to rank remaining orders by their probability of occurring. In our
interface, we present these to the user as orders are placed, in descending order of
probability. We do not present diagnoses on the order menus, because the goal is to suggest
treatments, leaving diagnosis to clinicians. An example of a simple BN, the underlying
probabilistic relationships, and the revised posterior probabilities given evidence is shown in
Figure 1. The methodology, Iterative Treatment Suggestion (ITS), is summarized in Table 1.
We implemented this methodology in Java using the SMILE toolkit [33], a freely available
toolkit for network inference. A prototype of this interface can be seen in Figure 2.
2.2. Inducing Bayesian Networks from Data
A common approach to induce a Bayesian network from data (called structure learning) is a
greedy search-and-score methodology. From a set of disconnected nodes, edges are added,
removed, and reversed until a network is found that best explains a training dataset
according to a scoring function. Here we used the BDeu scoring function. [34] A greedy
search is used because a complete exploration of all possible graphs is combinatorial, and so
is therefore not possible on networks of more than a few nodes. [35]
NIH-PA Author Manuscript
The most powerful greedy search is arguably the Greedy Equivalence Search (GES). [36]
Rather than searching Bayesian networks, it searches what are known as ‘equivalence
classes’ of Bayesian networks. These are groups of Bayesian networks that all are
probabilistically equivalent. If an optimal Bayesian network exists for the given dataset,
GES will always find it. Therefore, we used a GES implementation in the freely available
Tetrad toolkit. [37]
2.3. Hospital Simulation Methodology
To evaluate ITS in the myriad of evolving clinical situations, we chose to compare how well
the suggestion menus predict the actual next action taken in a hospitalization. Therefore we
wrote a program to simulate hospitalizations on our test set using the ITS methodology. As
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 6
NIH-PA Author Manuscript
in ITS (Table 1), our program places each order in the hospitalization in succession, adding
it to the ‘evidence’ in the network, and recalculating the posterior probabilities for variables
in the network. It also adds diagnoses as evidence at the appropriate time step in the
hospitalization. After each order in the hospitalization, our program records the posterior
probabilities in the menu (step 2), in order to calculate performance in predicting the next
order. To determine order succession within each hospitalization, we used the time and
session information in our order-entry data. Where two orders had the same recorded time,
we used both possible orderings and kept the higher-scoring combination. In the event an
order was placed more than once, subsequent placements were ignored (because our system
allows orders to be entered as evidence only once).
Using the recorded posterior probabilities and the actual next order placed, we were able to
compute the Area Under the Receiver-Operator Curve (AUC). This measures
discriminability, equivalent to the probability that when an order is placed, it will be ranked
higher than at previous times. We used the approach in Hanley and McNeil [38] to calculate
the AUC directly without first calculating the full ROC curve. The formula is as follows:
NIH-PA Author Manuscript
Here T⃗ is a list of posterior probabilities for true instances of a particular order, and F⃗ the
corresponding list for false instances.
We also computed the average position an order appears in the menu at the time it is
selected. This measures accuracy by reporting the average list length required for 100%
precision. The value is between one and the total number of orders in the network, where
one is the top of the menu (and is therefore the best outcome).
2.4. Comparison with Association Rule Mining
NIH-PA Author Manuscript
To compare our approach to pairwise association rule mining (ARM), we developed a
variant of the ITS hospital simulation methodology. It performs the same analysis of average
menu position but it uses a static menu of orders, which are arranged in descending
frequency of co-occurrence with the main diagnosis in each domain (e.g., pregnancy in
inpatient pregnancy). To facilitate direct comparison, the orders selected by GES were used
to generate the menu in each domain.
2.5. Evaluation
2.5.1. Data Source—For evaluation, we chose four modalities of medicine: inpatient
medicine, the emergency department (ED), the urgent visit clinic (UVC), and the intensive
care unit (ICU). Each modality reflects a different aspect of medicine. Inpatient care focuses
more on treatment than diagnosis in a longer-term stay, the ED involves a shorter stay
involving both diagnosis and treatment, the UVC involves a very brief ‘stay’ focused on
diagnosis, and the ICU involves tightly-correlated actions for very specific care.
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 7
We extracted data for four domain-specific BNs from the four selected modalities as
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Choosing chief diagnosis: We focused our domains on the most frequent diagnosis/
complaint for the four modalities: visits involving pregnancy in inpatient medicine,
back pain in the ED, hypertension in the UVC, and ‘altered mental state’ in the
Medical ICU (MICU).
Data extraction: We extracted and de-identified 3 years of inpatient order-entry
data from the local county hospital in Indianapolis (2007–2009) and chose visits
that corresponded with each domain. This involved 9228 ED back pain, 1821 UVC
hypertension, 4843 inpatient pregnancy, and 1546 ‘altered mental state’ MICU
Variable selection: For each domain, we selected 50 variables: the 40 most
frequent orders and the 10 most frequent co-occurring diagnoses and complaints.
Orders were of low granularity, which ensured sufficient data for predictive power;
for example, medication orders only included the type of medicine (e.g.,
vancomycin), not the route, dose, or frequency. The diagnoses and ,complaints
used in our networks can be seen in Table 2. Note that sometimes less than ten are
shown because fewer than ten diagnoses/complaints co-occurred with the
Train/test split: We split each data set into a training (2/3 of admissions) and test
set (1/3).
2.5.2. Computational Approach—Using these four data sets, we applied and evaluated
the BN and ARM methods as follows:
NIH-PA Author Manuscript
Network induction. Via GES (Section 2.2), we induced four Bayesian networks
using each of the four training sets. Because GES will discard nodes that do not
have predictive power, sometimes the resulting networks contained fewer than 50
nodes. This was most notable in the ICU network, where only 25 orders were
Hospitalization simulation. We ran our ITS hospital-simulation program (Section
2.3) on each the each of the four networks using their corresponding test set, which
collected statistics on AUC and average position in the menu at time of selection.
Visualization. We wrote a program to export the networks into Gephi format.
Gephi is an open-source network visualization tool. [39] We wrote a Gephi script
to select the Markov Blankets for a set of nodes. A Markov Blanket of a node is its
parents, children, and siblings, and is frequently used as a heuristic for the set of
most relevant variables in prediction. [40] This allowed us to visually examine
nodes in a graph and their most important neighbors.
Comparison to association rule mining. We ran our ARM-based hospitalsimulation (Section 2.4), which collected statistics on average position in a static
menu at time of selection.
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 8
NIH-PA Author Manuscript
A standard desktop computer induced each network (step 1) in less than 30 minutes and ran
the ITS hospital-simulation program (step 2) in an average of 5 minutes. Table 3 shows
summary statistics: average AUC and average menu position, weighted by the frequency of
each order. Figure 3 shows trendlines of the average position vs. order rank by frequency.
For each domain, the 10 orders in which the system performed best and worst (by AUC) are
shown in Table 4.
Figures 4–6 show portions of the graph structure (step 3). Figure 4 shows the Markov
Blankets around some nodes in the pregnancy network with high AUC. Figure 5 shows
nodes with high AUC and their parents and children in the other three networks. Figure 6
does the same with nodes of low AUC. Note that arrow directions should not be interpreted
as showing causality, only a statistical association.
NIH-PA Author Manuscript
Finally, Table 5 and Figure 7 compare the BN approach (step 2) to an ARM approach (step
4). Table 5 shows the weighted and unweighted average difference in list length between
ARM and BN. Figure 7 shows average menu position vs. order rank by frequency using
ARM. It is directly comparable to Figure 3 for the BN approach.
3.1. Analysis of BN Approach
The evaluation of our treatment suggestion system on four domain-specific BNs against test
cases drawn from the same environments showed fairly strong overall performance. In
particular, our treatment suggestion menus correctly suggest common orders in a short list:
3.91–5.83 items (Table 3). A length of five accurately suggests more than the top 20
inpatient pregnancy orders and emergency department back pain orders (Figure 3). Also, the
system’s average AUC is high (74%–84%, also in Table 3), meaning that common orders
are ranked higher at the time they are ordered than prior to ordering.
NIH-PA Author Manuscript
There was high variance in performance on individual orders (AUC 0.5–0.99), both across
and within domains (Tables 3 and 4). Within a domain, some orders are suggested almost
exactly when they should be, such as a cold pack in pregnancy visits and a pelvis CT in the
ED. Other orders appear at the bottom of long menus and are not predicted much better than
chance, such as a neurology consult in the ED. Performance varied across domains as well.
Inpatient pregnancy had a weighted average AUC .884 and menu position 3.91 (Table 3),
and even the least frequent orders required a menu length of only half the total orders
(Figure 3). In the other domains, average AUC and menu length were notably worse and the
least frequent orders required a menu length containing at least 75% of possible orders.
Figures 4–6 shed light on this phenomenon. For high AUC nodes (Figures 4 and 5), the
network diagrams are tight clusters with connections that make intuitive sense. For example,
postpartum is directly connected to adjuncts like simethicone, toothache is connected to a
dental consult, and related tests like magnesium and phosphorus levels are linked. This
clustering and intuitiveness indicates that the correct amount of context was provided for
these nodes. The pregnancy network formed one giant cluster, which likely explains its high
overall performance. The low-performing nodes in the other networks were either part of
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 9
NIH-PA Author Manuscript
smaller subnetworks, or, in the case of the MICU, relied on infrequent diagnoses that were
not in the test set (Figure 6). Relationships among low-performing nodes are frequently
almost linear and have non-intuitive connections, indicating transitive associations due to
missing context. For example, restaints is directly connected to vancomycin (see Figure 6,
MICU) - both might be appropriate when a patient has an infection causing delirium, but
they are not predictive of each other. Also a general medicine consult does not directly
predict a diagnosis of diabetes (see Figure 6, UVC), nor does a lumbar spine x-ray directly
suggest a knee x-ray. The context needed likely includes: additional well-chosen orders and
diagnoses, external information about patient health status, test results, and family history.
This points to the need for additional data sources and more principled feature selection.
NIH-PA Author Manuscript
Another interesting discovery is that AUC is not always strongly correlated with menu
position. Two examples can be seen in Table 4. A peripheral blood smear in the emergency
department has high AUC but an average menu position of 11.6, and an order for Lortab (a
narcotic painkiller) in inpatient pregnancy appears near the top of the suggestion menus but
has AUC of only 0.60. In the first case, we suspect that although the blood smear’s
probability increases just prior to it actually being ordered, it is never high enough to
outweigh other orders. In the second case, we believe the order stays at the top of the menu
until it is picked because it has a high prior probability. We therefore conclude that choosing
order-specific probability thresholds might be appropriate.
3.2. Comparison to ARM
Our results confirm previous results regarding ARM approaches: while an ARM approach
can readily detect the most common associations, the strength of less common associations
depend on context (e.g., previous orders and diagnoses) that ARM cannot capture.
In Table 5, there is a relatively small difference in weighted average menu length between
the two approaches (Table 5), especially in smaller domains like the ICU (difference +0.23
items). This indicates similar performance for the most common orders. However, the
unweighted difference is larger (+1.14–+7.64 items), suggesting that the BN approach is
having more impact on less common orders.
NIH-PA Author Manuscript
Comparing Figure 7 (ARM) to Figure 3 (BN) confirms this. Figure 3 displays a slow
increase in menu length as more orders are included, but Figure 7 shows a much steeper rise.
With the BN approach, a length of five accurately suggests an average of 16 orders (Figure
3). The same menu length with the ARM approach accurately suggests only 9 orders on
average (Figure 7). Performance degrades rapidly as menu length increases. This confirms
the BN approach’s overall superior performance.
3.3. Limitations and Future Directions
This research is predicated on the assumption that average patterns in the data represent
reasonably good care for future patients. As detailed in the Background, in many decisionmaking problems, average patterns do in fact represent ‘crowd wisdom,’ [41] but `crowd
madness’-the domination of bad decisions in a group - can occur as well. Automatically
discriminating wisdom from madness is important future work. Presently the ‘wisdom’
discovered should be reviewed by experts and aligned with guidelines before deployment.
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 10
NIH-PA Author Manuscript
The other principal limitation is that our models currently rely only on a small set of orders
and diagnoses. We do not include other important factors such as test outcomes and
physiologic changes. Also, we evaluated the networks using time-stamped data but the
algorithm we used to learn networks does not utilize time information. Additionally, among
orders and diagnoses, we choose the most frequent. All of this biases our system to shortterm decisions that can be made with minimal context. We believe accuracy will be
improved significantly with context-aware feature selection and temporal extensions to BN
structure learning.
Our system and evaluation do not currently accommodate multiple orders of the same item
within a hospitalization. Upon examination of our training sets, only orders in the ICU
occurred multiple times on average per hospitalization. However, in the ICU, 16 orders (e.g.
ventilator protocol changes, IV fluids, and common tests) do occur with multiplicity, and for
this we need to develop a more complex methodology. We are exploring use of a ‘temporal
window’ around the actual occurrence of the order in which we consider it a true instance.
NIH-PA Author Manuscript
The BN approach requires networks remain relatively small, or data requirements and
computational complexity become intractable. [42] We do not believe this makes them
unattractive to ‘big data’ problems, but it will require an approach to intelligently create sets
of largely independent domain-specific networks. We also plan to explore structure-learning
algorithms that scale to larger data sets.
Our comparison to ARM was a side-by-side comparison that might have unfairly benefitted
ARM. For one, only items chosen by GES were used in the menu - and some of the dropped
associations might have been incorrect transitive associations. Also, including less common
orders might show even more difference between BN and ARM. Further comparison is
important future work.
Finally, our evaluation measures - AUC and menu position - only capture two aspects of the
approach’s predictive performance - discriminability and precision. There are many other
classification evaluation measures (see for example [43]). For this methodology, it would
also be valuable to measure the menu’s utility as a decision-making aid. This could be done
computationally using a decision-theoretic approach like decision curve analysis [44], or by
soliciting feedback regarding sample menus from potential users.
NIH-PA Author Manuscript
The proliferation of medical data in EMRs offers an opportunity to abstract these data for
use in clinical decision support. Both the challenges associated with creating localized
decision support and the incompleteness of guideline recommendations make this an
important task. Existing approaches using pairwise association rule mining produce long
static lists that accurately capture only common, direct associations.
In this work, we have developed and implemented a system using Bayesian network
learning to discover the typical successions of orders made by clinicians from local orderentry data, which we have used as an adaptive recommendation system to suggest the most
common next orders based on what has been ordered and diagnosed previously. We used a
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 11
hospitalization-simulation evaluation methodology to determine how well our system
reproduces reasonable behavior in four medical domains.
NIH-PA Author Manuscript
Our system performed fairly well on average in all domains but had variance that suggested
future improvements. It performed best in inpatient pregnancy (weighted average AUC .844,
weighted average menu position 3.91) and worst in the urgent visit clinic (weighted average
AUC .741, weighted average menu position 4.88). Our system had near-perfect performance
on some orders (e.g., cold pack in inpatient pregnancy) but very poor performance on others
(e.g., arterial blood gas monitoring in the medical intensive care unit). Higher performance
appears to correlate with the presence of more factors needed to predict the order.
Comparing our system to an ARM-based equivalent, we found that only the most common
orders are accurately suggested by both systems, and that a menu length of five suggested
only about half as many orders accurately in ARM vs. BN. This confirms that despite the
future work needed in our system, it does outperform existing approaches.
NIH-PA Author Manuscript
This study is a step forward in clinical knowledge-abstraction systems. Such a system could
eventually be part of the envisioned “learning health system,” in which a variety of clinical
users-including researchers, administrators, and physicians - could dynamically analyze vast
amounts of data for improved decision-making. This could be used for e.g., workload
reduction in developing localized CDS, or as a method to quickly analyze local practice
Thanks to: Jeff Warvel for providing both data and expertise regarding the county-hospital order-entry system; and
to Siu Hui for her insights into statistics and evaluation approaches. This work was performed at the Regenstrief
Institute, Indianapolis, IN and at the Massachusetts General Hospital Laboratory for Computer Science, Boston,
MA. This work was supported in part by grant 5T15 LM007117-14 from the National Library of Medicine.
NIH-PA Author Manuscript
1. Corrigan, JM.; Donaldson, MS.; Kohn, LT.; Maguire, SK.; Pike, KC. Washington DC: Institute of
Medicine; 2001. Crossing the quality chasm: a new health system for the 21st century.
2. Kaushal R, Shojania KG, Bates DW. Effects of computerized physician order entry and clinical
decision support systems on medication safety: a systematic review. Arch Intern Med. 2003;
163:1409–1416. [PubMed: 12824090]
3. Waitman LR. Pragmatics of Implementing Guidelines on the Front Lines. Journal of the American
Medical Informatics Association. 2004; 11:436–438. [PubMed: 15449402]
4. Geissbuhler A, Miller RA. Distributing knowledge maintenance for clinical decision-support
systems: the ‘knowledge library’ model. Proceedings of the AMIA symposium. 1999:770.
5. Garg AX, Adhikari NKJ, McDonald H, Rosas-Arellano MP, Devereaux PJ, Beyene J, Sam J,
Haynes RB. Effects of Computerized Clinical Decision Support Systems on Practitioner
Performance and Patient Outcomes: A Systematic Review. JAMA. 2005; 293:1223–1238.
[PubMed: 15755945]
6. Zhou L, Soran CS, Jenter CA, Volk LA, Orav EJ, Bates DW, Simon SR. The relationship between
electronic health record use and quality of care over time. J Am Med Inform Assoc. 2009; 16:457–
464. PMID:19390094. [PubMed: 19390094]
7. Van der Sijs H, Aarts J, Vulto A, Berg M. Overriding of Drug Safety Alerts in Computerized
Physician Order Entry. Journal of the American Medical Informatics Association. 2006; 13:138–
147. [PubMed: 16357358]
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 12
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
8. Standards & Interoperability (S&I) Framework. Health eDecisions Homepage. http://
9. Klann J, Wright A, McCoy A, Sittig D, Murphy S. Medical App Stores, Physician Cognitive
Overload, and Research Data Repositories: an Integration. Proceedings of Medicine 2.0 2012. 2012
10. Gorman PN, Ash J, Wykoff L. Can primary care physicians’ questions be answered using the
medical journal literature? Bull Med Libr Assoc. 1994; 82:140–146. PMID:7772099. [PubMed:
11. Haug JD. Physicians’ preferences for information sources: a meta-analytic study. Bull Med Libr
Assoc. 1997; 85:223–232. PMID:9285121. [PubMed: 9285121]
12. Perley CM. Physician use of the curbside consultation to address information needs: report on a
collective case study. J Med Libr Assoc. 2006; 94:137–144. PMCID: PMC1435836. [PubMed:
13. Ford EW, Menachemi N, Phillips MT. Predicting the adoption of electronic health records by
physicians: when will health care be paperless? J Am Med Inform Assoc. 2006; 13:106–112.
[PubMed: 16221936]
14. Blumenthal D, Tavenner M. The ‘Meaningful Use’ Regulation for Electronic Health Records. N
Engl J Med. 2010; 363:501–504. PMID 20647183. [PubMed: 20647183]
15. Office of the National Coordinator for Health IT. Federal Health Information Technology Strategic
Plan 2011–2015. 2011.
16. McGlynn EA, Asch SM, Adams J, Keesey J, Hicks J, DeCristofaro A, Kerr EA. The Quality of
Health Care Delivered to Adults in the United States. N Engl J Med. 2003; 348:2635–2645.
[PubMed: 12826639]
17. Condorcet M. Essay sur l’application de l’analyse de la probabilité des decisions: Redues et
pluralité des voix. l’Imprimerie Royale. 1785
18. Arrow KJ. A Difficulty in the Concept of Social Welfare. The Journal of Political Economy. 1950;
19. Austen-Smith D, Banks JS. Information Aggregation, Rationality, and the Condorcet Jury
Theorem. The American Political Science Review. 1996; 90:34–45.
20. Fisher E, Goodman D, Skinner J, Bronner, Kristen. Health Care Spending, Quality, and Outcomes.
21. Hasan S, Duncan GT, Neill DB, Padman R. Towards a collaborative filtering approach to
medication reconciliation. AMIA Annu Symp Proc. 2008:288–292. PMID:18998834. [PubMed:
22. Wright A, Chen E, Maloney FL. Using Medication Data and Association Rule Mining for
Automated Patient Problem List Enhancement. AMIA Annu Symp Proc. 2009:707.
23. Klann J, Schadow G, McCoy JM. A Recommendation Algorithm for Automating Corollary Order
Generation. Proceedings of the AMIA Symposium. 2009:333–337.
24. Carter JS, Brown SH, Erlbaum MS, Gregg W, Elkin PL, Speroff T, Tuttle MS. Initializing the VA
medication reference terminology using UMLS metathesaurus co-occurrences. Proceedings of the
AMIA Annual Symposium. 2002:116–120. PMID:12463798.
25. McCoy AB, Wright A, Laxmisan A, Ottosen MJ, McCoy JA, Butten D, Sittig DF. Development
and evaluation of a crowdsourcing methodology for knowledge base construction: identifying
relationships between clinical problems and medications. Journal of the American Medical
Informatics Association. 2012; 19:713–718. PMID:22582202. [PubMed: 22582202]
26. Frankovich J, Longhurst CA, Sutherland SM. Evidence-Based Medicine in the EMR Era. New
England Journal of Medicine. 2011; 365:1758–1759. PMID:22047518. [PubMed: 22047518]
27. Linden G, Smith B, York J. Amazon, com Recommendations: Item-to-item Collaborative Filtering.
28. Wright A, Chen ES, Maloney FL. An automated technique for identifying associations between
medications, laboratory results and problems. Journal of Biomedical Informatics. 2010; 43:891–
901. PMID:20884377. [PubMed: 20884377]
29. Wright A, Pang J, Feblowitz JC, Maloney FL, Wilcox AR, Ramelson HZ, Schneider LI, Bates
DW. A method and knowledge base for automated inference of patient problems from structured
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 13
NIH-PA Author Manuscript
NIH-PA Author Manuscript
data in an electronic medical record. J Am Med Inform Assoc. 2011; 18:859–867. [PubMed:
30. Heckerman DE, Nathwani BN. Toward normative expert systems: Part II. Probability-based
representations for efficient knowledge acquisition and inference. Methods of Information in
medicine. 1992; 31:106–116. [PubMed: 1635462]
31. Heckerman D. A Tutorial on Learning with Bayesian Networks. Innovations in Bayesian
32. Klann J, Schadow G, Downs S. A Method to Compute Treatment Suggestions from Local Order
Entry Data. Proceedings of the AMIA Symposium. 2010:387–391. PMID:21347006.
33. Druzdzel, MJ. SMILE: Structural Modeling, Inference, and Learning Engine and GeNIe: a
development environment for graphical decision-theoretic models; Proceedings of the 16th
national conference on Artificial intelligence and the 11th Innovative applications of artificial
intelligence conference; 1999. p. 902-903.
(ACM ID: 315504.)
34. Buntine, W. Theory refinement on Bayesian networks; Proceedings of the seventh conference
(1991) on Uncertainty in artificial intelligence; 1991. p. 52-60.
id=114098.114105 (ACM ID:114105.)
35. Eaton D, Murphy K. Exact Bayesian structure learning from uncertain interventions. AI &
Statistics. 2007:107–114.
36. Chickering DM. Optimal structure identification with greedy search. J Mach Learn Res. 2003;
37. Ramsey J. Tetrad Project Homepage. 2011
38. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic
(ROC) curve. Radiology. 1982; 143:29–36. [PubMed: 7063747]
39. Bastian M, Heymann S, Jacomy M. Gephi: An Open Source Software for Exploring and
Manipulating Networks. 2009
40. Tsamardinos I, Aliferis CF. Towards principled feature selection: Relevancy, filters and wrappers.
Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. 2003
41. Surowiecki, J. The wisdom of crowds. Random House: Inc.; 2005.
42. Chickering DM, Heckerman D, Meek C. Large-sample learning of Bayesian networks is NP-hard.
The Journal of Machine Learning Research. 2004; 5:1287–1330.
43. Medlock S, Ravelli ACJ, Tamminga P, Mol BWM, Abu-Hanna A. Prediction of Mortality in Very
Premature Infants: A Systematic Review of Prediction Models. PLoS ONE. 2011; 6:e23441.
[PubMed: 21931598]
44. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models.
Med Decis Making. 2006; 26:565–574. PMID: 17099194PMCID: PMC2577036. [PubMed:
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 14
NIH-PA Author Manuscript
Local wisdom can complement expert-curated decision support to reduce
We reconstruct physician decisions by aggregating events in medical record
Our approach finds complex data relationships and creates dynamic order
We simulate the menus in a test set; many scenarios show strong performance
Our approach increasingly outperforms a simpler approach as orders become
NIH-PA Author Manuscript
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 15
NIH-PA Author Manuscript
Figure 1.
An example Bayesian Network (left), the conditional probability tables associated with it (middle), and the posterior
probabilities given the evidence of ‘Abdominal Pain’ (right).
NIH-PA Author Manuscript
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 16
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 2.
A prototype implementation of Iterative Treatment Suggestions (ITS). The panel shows the current evidence (labeled 0 or 1) and
the possible orders in descending probability order. As orders and diagnoses are placed (the toggle button), the evidence is
revised and the posterior probability of possible orders given the network is recalculated.
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 17
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 3.
The average position in the list at the time of order vs. the frequency rank of the order in the test sets.
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 18
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 4.
A portion of the inpatient pregnancy networks. This figure shows the Markov blankets of C-Section Operative Note, Ext. UC
Monitor, and Sitz Bath, three nodes with high AUC in Table 4. These three Markov Blankets comprise the majority of the total
graph, and the graph forms one single connected component - indicating strong relationships between all nodes in this network.
Orders are purple; problem/complaints are yellow. Node/label size is proportional to AUC, and edge weight is an approximation
of the strength of relationship. Notice the highly-correlated clusters, e.g. Sitz bath and other postpartum treatments (cold pack,
ice chips, lanolin, etc).
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 19
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 5.
High AUC nodes from Table 4 with their parents and children in all domains but inpatient. MICU is blue, UVC is green, and
ED is red. Problems/complaints are yellow. Node/label size is proportional to AUC, and edge weight is an approximation of the
strength of the relationship. Here, notice the logical clusters and intuitively correct relationships.
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 20
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 6.
Low AUC nodes from Table 4 with their parents and children in all domains but inpatient. Notice the linear chains, multiple
subnetworks, connection to infrequent diagnoses, and transitive relationships. This indicates appropriate context is lacking for
these nodes.
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 21
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 7.
Using an association rule mining approach, the average position in the list at the time of order vs. the frequency rank of the order
in the test sets.
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 22
Table 1
NIH-PA Author Manuscript
A formal description of the ITS methodology for suggesting orders via a Bayesian network. This parallels the
graphical example in Figure 2.
Algorithm: Iterative Treatment Suggestion (ITS)
G is a Bayesian Network Model
O is a set of possible orders, initially including all orders in G
D is a set of possible diagnoses, including all diagnoses in G
E is a set of evidence, initially containing all D set to false
1. Update beliefs (compute the posterior probability of all O ∉ E).
2. Create a list of all O ∉ E in descending order of posterior probability, optionally stopping at a predefined threshold.
3. Display the list and D to the user and wait for the user to choose an order or diagnosis from the list.
4. Move the order from O to E, or set the diagnosis to true in E.
Until the user closes the session
NIH-PA Author Manuscript
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
C-Section Repeat
Failed induction
Failed Induction
Preterm Labor
Abdominal Pain
Shoulder Pain
Med Refill
Knee Pain
Chest pain
Abdominal Pain
Neck Pain
Spont Vag Delivery
Cesarean Section
Back pain, ED
Vehicle Accident
Tubal Ligation
Pregnancy, Inpatient
Coronary Artery Disease
Back Pain
Diabetes Mellitus
Med Refill
Hypertension, UVC
Diabetes Mellitus
Drug Abuse
Medical ICU
The co-occurring diagnoses and complaints in each domain-specific network, listed by their prevalence in the test sets. 0% indicates the co-occurrence
was only present in the training set. These were used as evidence as they appeared in the test cases, and were not part of the predictive evaluation.
NIH-PA Author Manuscript
Table 2
Klann et al.
Page 23
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Klann et al.
Page 24
Table 3
NIH-PA Author Manuscript
For each domain, the weighted average AUC (area under the receiver-operator curve) and position in menu at
time of order, where 1 is the top suggestion). Weighting is by frequency of order.
Weighted Average:
Inpatient Pregnancy
Medical Intensive Care Unit
Back pain in the Emergency Department
Hypertension in the Urgent Visit Clinic
NIH-PA Author Manuscript
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
IV Lock
Syphilis Screen
Ice Chips
IV Fluids
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.
Type and Screen
Lortab 5/500
I&O Monitoring
Drugs Urine Test
Oxytocin Protocol
Ext. FHT Monitor
Docusate Na
Ext. UC Monitor
Lung Exercise
Naloxone Inj
Morphine (PCA)
Cold Pack
Sitz Bath
Pregnancy, Inpatient
Lumbar Spine MRI
Med Follow-up Consult
Medicine Consult
Neurosurgery Consult
EPIC Referal
Sports Med. Consult
Wrist Xray
Knee Xray
Lumbar Spine CT
Phys. Therapy Consult
Comp. Metabolic
Spine Cervical CT
Chest CT
Vaginal Infection Test
Blood Cell Profile
Cardiac Markers
Peripheral Smear
Pelvis CT
Abdomen CT
Back Pain, ED
Knee Xray
T4-Free Level
Head CT
Physl Therapy Consult
Lateral Chest Xray
Dermatology Consult
Med Follow-up Consult
Medicine Consult
Hgb A1c
Dental Consult
Urine Culture
Blood Cell Profile
BNP Test
Drug Abuse Urine Test
Blood Culture
ESR Test
Cardiac Markers
Hypertension, uvc
Arterial blood gas
Frontal Chest Xray
IV Fluids
Cardiac Markers
Basic Metabolic Panel
Magnesium Level
Phosphorus Test
Ventilator Adjustment
Vancomycin Level
Altered Mental State, MICU
Order name, AUC, and average menu position (#) of the best and worst order predictions in each domain. ‘Best’and ‘worst’ are chosen by AUC (higher is
better). Menu position, showing the average location in the suggestion menu just before selection, is also reported (lower is better).
NIH-PA Author Manuscript
Table 4
Klann et al.
Page 25
Klann et al.
Page 26
Table 5
NIH-PA Author Manuscript
For each domain, the weighted average position in menu at time of order, where 1 is the top suggestion, for
the BN and ARM approaches. Weighting is by frequency of order. Also shows the weighted and unweighted
difference in average list length (ARM-BN).
Weighted Average Position:
Inpatient Pregnancy
Medical Intensive Care Unit
Back pain in the Emergency Department
Hypertension in the Urgent Visit Clinic
NIH-PA Author Manuscript
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2015 April 01.

Decision support from local data: Creating adaptive order Please share