The CLEF Chronicle: Transforming Patient Records into an E-Science Resource Alan Rector

The CLEF Chronicle: Transforming
Patient Records into an E-Science Resource
Jeremy Rogers, Colin Puleston, Alan Rector
Biohealth Informatics Group, University of Manchester, UK
Electronic patient records are typically optimised for delivering care to a single patient. They omit significant
information that the care team can infer whilst including much of only transient value. They are secondarily an
indelible legal record, comprised of heterogeneous documents reflecting local institutional processes as much as, or
more than, the course of the patient’s illness.
By contrast, the CLEF Chronicle is a unified, formal and parsimonious representation of how a patient’s illness
and treatments unfold through time. Its primary goal is efficient querying of aggregated patient data for clinical
research, but it also supports summarisation of individual patients and resolution of co-references amongst clinical
documents. It is implemented as a semantic network compliant with a generic temporal object model whose
specifics are derived from external sources of clinical knowledge organised around ontologies.
We describe the reconstruction of patient chronicles from clinical records and the subsequent definition and
execution of sophisticated clinical queries across populations of chronicles. We also outline how clinical
simulations are used to test and refine the chronicle representation. Finally, we discuss a range of engineering and
theoretical challenges raised by our work.
1. Introduction
The Clinical e-Science Framework (CLEF) project
is a United Kingdom eScience project, sponsored by
the Medical Research Council. It aims to establish a
policy and technical infrastructure through which data
arising from routine medical care across multiple sites
and institutions may be collected and presented as an
aggregated research repository in support of
biomedical research [1].
Existing patient records are very rich. A typical
example in our research corpus contains a thousand or
more numeric data points, a chronology of five or six
hundred significant clinical events (clinics attended,
drugs dispensed, etc.), plus a couple of hundred
narrative documents (letters between doctors,
pathology, radiology or body scan results etc.).
However, even when in electronic form, most of this
data is intended for exclusively human interpretation.
Furthermore, much of the critical information is left
implicit. To take a common example from the cancer
domain, it is rarely stated explicitly why the drug
Tamoxifen is being given. Clinicians know that there is
little other reason to prescribe it other than to prevent
recurrence of breast cancer and, as a general rule, do
not bother to record what can be assumed. Similarly,
when a drug must be stopped because of its side effects
(e.g. anaemia due to chemotherapy), the causal
connection between side effect and stopping the drug is
rarely mentioned, and even the nature of the side effect
itself may have to be inferred from e.g. serial
laboratory tests showing low haemoglobin values
rather than being explicitly stated as ‘anaemia’.
Other partners in the CLEF programme are
researching technologies to retrieve useful information
from clinical narratives [2]. This paper is a progress
report on an orthogonal problem: assuming the full
richness of clinical information were available –
whether extracted from traditional clinical records post
hoc or acquired a priori using entirely different clinical
data capture paradigms – how might that information
be represented for the maximal benefit of clinical
research? Clinicians are particularly interested in
certain types of information rarely formalised in
current records, e.g.
• the precise nature of any imaging investigation
(e.g. which anatomical structure was imaged)
• the result of any investigation (what was found,
what it signifies)
• why any investigation was ordered
• earlier provisional diagnoses
• what symptoms the patient experienced from the
disease, or its treatment
• when treatment was changed and why
• abstractions such as ‘relapse’ or ‘stage’ of disease
Our goal is to define a representation supporting
increasingly detailed information about patients and
allowing us to evaluate its usefulness. We call this
novel representation the CLEF Chronicle.
2. Properties of CLEF Chronicle
The CLEF Chronicle for an individual patient seeks
to represent their clinical story entirely as a network of
typed instances and their interrelations. Figure 1
illustrates the general flavour of what we are trying to
represent: a patient detects a painful mass in their
breast, as a result of which a clinic appointment occurs,
where drug treatment for the pain is arranged and also
a biopsy of the (same) mass in the (same) breast. The
first clinic arranges a follow-up appointment, to review
the (same) biopsy, which finds cancer in the (same)
mass. This (same) finding is in turn the indication for
radiotherapy to the (same) breast.
P ain
in dication
abo ut
lo cus
B reast
D rug
C linic
lo cus
p lans
C linic
plan s
lo cus
abo ut
abo ut
lo cus
B iopsy
fin din g
in dication
M ass
lo cus
plan s
R adio
in dication
C ancer
F ig u re 1 : In fo rm a l V iew o f P a tien t-H isto ry F ra g m en t
(N O T E : T im e-flo w is ro u g h ly left Æ rig h t)
In addition to the obvious structural difference
between this representation and that of traditional
electronic records, any clinical content represented as a
CLEF Chronicle should have two central properties:
Parsimony – traditional patient records contain
multiple discrete mentions of relevant instances in the
real world (the tumour, the breast etc). A CLEF
Chronicle should have only one occurrence of each.
Explicitness – traditional patient records may only
imply clinically important information (e.g. the fact of
relapse). This must be explicit within a CLEF
Chronicle Representation.
3. Functions of CLEF Chronicle
Firstly, and most importantly, the CLEF Chronicle
is intended to support more detailed, and more
expressive, querying of aggregations of patient stories
than is currently possible whilst at the same time
improving the efficiency of complex queries. More
detailed, because the Chronicle is more explicit: we
can now ask e.g. how many patients ‘relapsed’ within a
set period of treatment. More expressive, because the
typing information associated with each Chronicle
instance is drawn from a rich clinical ontology, such
that queries may be framed in terms of arbitrarily
abstract concepts: we can ask how many cancers of the
lower limb were recorded, and expect to retrieve those
of all parts of the lower limb. This is more efficient,
because the traditional organization of patient records
tends to require much serial or nested processing of
Secondly, an individual Chronicle can serve as an
important knowledge resource during its own
reconstruction from available electronic sources of
traditional clinical records. In particular, a Chronicle
can help resolve the frequent co-references and
repeated references to real-world instances such as
characterize traditional records. For example, heuristic
and other knowledge linked to an ontology of
Chronicle data types can be used to reject any request
to instantiate more than one identifier for a [Brain], or
any attempt to merge two mentions of [Pregnancy]
separated by more than 10 months.
Thirdly, the Chronicle is intended to serve as a
knowledge resource from which summarising
information or abstractions may be inferred. For
example, deducing ‘anaemia’ from a run of discrete
low blood counts obtained from the traditional record,
or ‘remission’ from several years of clinical inactivity
and ‘relapse’ when this is followed by a flurry of tests
and a new course of chemotherapy. Similarly, where
the record does not explicitly say why drug X was
given, reasoners browsing a Chronicle may identify
condition Y because the drug has no other plausible
context of use.
Fourthly, the CLEF Chronicle is intended to support
automatic summarisation of patient records. Given the
sometimes chaotic nature of real patient records,
manual case summarisation is recognised as good
clinical practice. Manual derivation of such summaries
from the content of the record is, however, notoriously
time consuming whilst the result of such labours is
notoriously out of date whenever it would be most
clinically valuable.
The fundamental problem is that existing patient
records are really “logbooks” (with similar legal
significance) of what healthcare staff have heard, seen,
thought and done [3] They often contain contradictory
information; each entry reflects the understanding of
the problem at the time it was made so that tracing the
evolution of problems is non-trivial. The information is
recorded in the order in which it was discovered rather
than in the order in which it occurred – hence later
entries may often include information about earlier
events, and a precise diagnosis may not be available
until long after the first entries that pertain to it were
made. Much of the information is either un-interpreted
(for example the serial low haemoglobins already
described) or under-interpreted: by convention,
radiologists only report what they are confident they
have seen on an image. Thus, a typical xray report may
state only ‘There is evidence of moderate osteoporosis
in the bony spine’ thus leaving it to the requesting
physician to infer the more important additional
interpretation ‘(but) there are no osteolytic lesions that
might suggest to cancer has spread to the bone’.
Maintaining summaries of clinical records,
therefore, requires them to be transformed from a log
whose chronology reflects the time of discovery of the
underlying data to one reflecting the order of
occurrence of events. Further, the chronology must be
recorded at an appropriate level of temporal abstraction
such that we may infer our best guest at what happened
and how and why the patient was managed.
A CLEF Chronicle is intended to be a valuable
substrate from which custom summaries of the story
may be generated as human readable text. Different
summarization strategies (and display vocabularies)
can be used for different classes of user, such as for the
doctor or the patient.
4. Chronicle Representation
The Chronicle Representation comprises a
collection of core concepts represented as a Java-based
Chronicle Object Model (COM), and more detailed
knowledge represented via a set of declarative External
Knowledge Sources (EKS). This dual scheme is
determined by the differing characteristics of the
represented concepts. The architecture also allows for
the incorporation of EKS-related inference mechanisms
that can assist in the dynamic expansion of the COM
(see below).
The concepts represented in the COM require
associated procedural code of a concept-specific
nature. The object-oriented format provides a natural
means of achieving this. Though the representation of
such concepts might alternatively be divided into
declarative and procedural sections, this would result
in an inelegant duplication of the model structure.
Moreover, we believe the COM will be relatively
stable and generic, such that ease-of-update and
flexibility are not primary considerations.
Conversely, the EKS contain detailed knowledge
requiring more regular maintenance, but are less likely
to require concept-specific processing. Hence, a
declarative format is more appropriate.
The precise division between COM and EKS is
pragmatic rather than principled. Details are likely to
evolve as the Chronicle Representation develops, with
concepts crossing the line in both directions.
We envisage the EKS as a collection of knowledge
sources, ultimately provided by multiple different
ontologies, databases and sets of medical archetypes
[4]. Similarly, the EKS-related inference mechanisms
could be of varying types, including logic-based
reasoning systems, rule-base systems, and dedicated
procedural mechanisms. The COM is indifferent to
how components of the EKS are represented, and how
associated inferences are achieved, with both
representations and inference mechanisms being
accessed via a suitable Java API. Currently, the EKS
for the patient chronicle is provided by a single OWL
ontology developed specifically to meet our immediate
requirements, and the associated inference by a
Description Logic [6] reasoning system (FaCT++ [9]).
This example helps illustrate the two main types of
procedures embodied within the COM. These are:
P ath olo g y
L ocu s E K S
typ e
lo cu s
L o cu s
P ath olo g y
H istory
(S P A N )
P ath o lo g y-T yp e E K S
h isto ry d escrip tors[]
V a riab le
H isto ry
(S P A N )
sn ap sh ots[]
P ath olo g y
S n a p sh o t
(S N A P )
sn ap sh o td escrip tors[]
V a riab le
V a lu e
so u rces[]
C lin ica l P ro ced u re
(S N A P o r S P A N )
go a ls[]
C lin ical-G o al E K S
C lin ica l G o a l
F ig u re 2 : C h ro n icle O b ject M o d el (C O M ) F ra g m en t
(S im p lified V ersio n )
The COM has been developed as a standard Javastyle API, primarily for use by the ‘chroniclisation’
mechanisms that we will develop, and the ‘chronicle
simulator’ (described below). However, a more generic
network-style representation, together with an
associated translation mechanism based on the Java
introspection facility, is available. This alternative
representation provides a means of representing COMbased queries, and is suitable for driving chronicledisplay and query-formulation GUIs.
The COM comprises both a generic section and a
specific clinical section, with the latter being built upon
the former. The generic section includes, most notably,
the following:
• EKS access facilities;
• A temporal model that implements Smith’s
SNAP/SPAN distinction [5].
Figure 2 depicts a fragment of the clinical section. It
can be seen that the representation of a particular
pathology (such as an individual tumour or headache)
comprises two main elements:
• A single PathologyHistory object, representing the
pathology as a SPAN entity through time.
• An associated set of PathologySnapshot objects,
each representing a SNAP view of the pathology at a
specific point in time.
The pathology representation also includes,
amongst other things, references to EKS concepts
representing type (e.g. ‘Tumour’) and location (e.g.
• Dynamic COM expansion based on EKS
knowledge, and EKS-related inference: For
example, the EKS that represents pathology-types
will provide sets of ‘descriptor’ attributes for each
concept (e.g. ‘size’ and ‘shape’ for the ‘Tumour’
concept). These attributes are used to dynamically
create sets of both ‘snapshot-descriptor’ and
‘history-descriptor’ fields on the PathologySnapshot
Furthermore, as the fields in the COM (both static
and dynamically-created) are assigned values, the
EKS-related inference mechanisms will be
consulted, which may result in the provision of
additional ‘descriptor’ attributes. Following on from
the above example: if the ‘location’ field on the
PathologyHistory object is assigned a value of
‘Breast’, then the inference mechanism will infer
that an additional ‘descriptor’ attribute called
‘herceptin-2-receptor’ is required. This is a Boolean
valued attribute that is only applicable to tumours
located in the breast.
• Dynamic data abstraction: For example, the value
of a ‘history-descriptor’ field is a dynamically
updated temporal summary of the values of the
corresponding set of ‘snapshot-descriptor’ fields
(e.g. if the set of snapshots for ‘Tumour’ records
‘size’ as ‘2’, ‘4’ and ‘7’ mm, the temporal
summaries will include ‘max-value = 7’ and
‘strictly-increasing = true’).
The COM also involves the expression of various
types of data/query creation constraint.
5. The CLEF Simulator
Testing or validation of the CLEF Chronicle is
problematic because no real patient data contains the
details required in machine readable form. Indeed, a
major purpose of developing the chronicle is to guide
efforts to improve patient records and make them more
appropriate for research. Efforts to transcribe manually
even small amounts of our experimental corpus of
traditional electronic records proved so time
consuming as to be impractical.
Our solution to this impasse has been to construct a
simulator to model breast cancer patients. The
processes modelled include the way in which tumour
cell colonies grow, metastasise and cause local and
systemic effects, the behaviour of the patient both as a
biological system and as a sentient healthcare
consumer in response to local or systemic
symptomatology arising from either the tumour or its
treatment, and the actions of the clinician as a provider
of investigations of varying sensitivity and treatments
of variable effectiveness. The modelling process
involves dynamic event-driven interactions through
simulated time of all three actors (disease, patient and
clinician). The simulator generates simulated clinical
stories of similar content and complexity to real life,
represented both as chronicles and as note-form text.
Whilst the simulator models were engineered to
superficially approximate real populations of patients,
they do not pretend to possess any predictive properties
with respect to, for example, what effect any change in
treatment efficacy, or clinical service reconfiguration,
might have on real population outcomes. Rather, for
our stated purpose of testing the Chronicle as a means
to represent individual patient stories, it is sufficient to
evaluate whether the complexity and content of
individual generated clinical stories approximates that
of real clinical stories: whether they have surface truthlikeness, or verisimilitude.
Judging the truth-likeness of generated stories is
necessarily subjective and imprecise. Formal
evaluation would be inherently problematic not least
because of the difficulty in obtaining sufficient time
from specialist clinicians to serve as the judges. At the
present time, therefore, we base our claim that the
simulated stories approximate real clinical stories on
the following evidence:
• The prototype simulator was written by a clinician
(JR) and is based on a detailed model of tumour,
patient and clinician behaviour and their mutual
dynamic interactions.
Figure 3 : graphical plot of simulator
parameters over course of one simulation
• Inspection by JR of several hundred simulations,
represented both as short note output (Figure 5) and
graphical time-line views of key simulator
parameters (Figure 3), did not identify any
significantly implausible patients.
• Although not designed or required to produce
simulated populations of patients with characteristics
mimicking those of real populations,
Kaplan-Meier curves (a standard tool for comparing
cancer survival) for populations of simulated patients
are similar to those for populations of real patients
(Figure 4), including those curves representing
subanalyses for tumour grade or disease stage at
• Two senior oncology physicians invited to comment
on the generated narrative output of simulators
Figure 4 : Overall survival curves for real and
simulated breast cancer patients
suggested only minor changes (e.g. both AST and
GGT should appear as liver damage markers).
• Queries similar to those forming the central research
question in typical registered cancer trials may be
constructed and executed against the simulated
chronicle repository, as well as more detailed
questions such as ‘what effect on long term survival
accrues from interrupting a course of chemotherapy
in order to go on holiday’.
Whilst this provisional evaluation of the simulator
suggests that its generated clinical stories are adequate
simulacra of reality, its documentation of them in
CLEF Chronicle format clearly significantly exceeds
that of usual clinical practice both in detail and
completeness: it includes many information holders
that are not explicitly present in the traditional medical
record, in particular the relationships between entities
in the story. For example, every test has an explicit
causal relation with the phenomenon (usually a
problem) that was the reason for requesting the test.
Additionally, the simulator records clinically
significant entities that are often not only entirely
absent from the traditional clinical record, but also
would be difficult to automatically infer. These include
significant negative investigation findings, the stage of
the disease at a given point in time, or exactly when
(and on what evidence) a patient changed status from
being in remission to being in relapse.
Patient notices a problem aged 46 (wk 209)
** Patient presents aged 45.8 (wk 212) **
Ix:Xray chest: normal
Ix:Clinical examination showed: a new 4.06cm breast mass
Follow-up in 3 weeks with results to plan treatment
** Initial Treatment Consult: week 215 **
** Surgery **
4.26 cm lesion was incompletely excised
Ix:Histopathology report on excision biopsy
Grade 8 invasive ductal adenocarcinoma of breast.
Oestrogen receptor negative
21 nodes were positive
Six weeks of radiotherapy to axilla should follow surgery
Secondaries: commence 6 weeks of radiotherapy.
Secondaries: commence 6 weeks of systemic chemotherapy.
High grade tumour: for chemotherapy
Tumour is hormone insensitive. No hormone antagonist
Follow-up in 8 weeks
** Consult to assess response to treatment : week 223 **
Ix:Xray chest: normal
Ix:Clinical examination normal
No evidence of further tumour.
See in 8 weeks
Radiotherapy cycle given. 5 remaining
Radiotherapy cycle 5 deferred because patient on holiday
Radiotherapy cycle given. 4 remaining
<<< SNIP >>>
** Consult to assess response to treatment : week 428 **
Ix:Xray chest: abnormal.
No new lesions found
2 old lesions found
Largest measures 1cm, smallest is 0.7cm
High alkaline phosphatatse: for bone scan
Ix:Bone Scan: abnormal.
No new lesions found
17 old lesions found
Largest measures 1.2cm, smallest is 0.7cm
Patient confused: for CTScan Brain
Ix:CTScan of brain: abnormal.
No new lesions found
6 old lesions found
Largest measures 1.4cm, smallest is 0.8cm
High grade unresponsive tumour. Palliative Care
See in 8 weeks
** Died aged 50 from cancer (week 475)
Grade 8 - 23 local mets, 70 distant metastasis
Seen in clinic 19 times and 4 course of chemo prescribed
Total tumour mass 1374700 in 200 tumours of which 48
ESR=160 Creatinine=559
41 tumours were destroyed or removed
54 bony mets, total mass 740941 d=5.41cm AlkPhos=4811
Figure 5: Extract of simulator ‘short note’ output
Using the simulator output, we may cautiously
estimate the richness and complexity that ‘real’ clinical
stories – and aggregations of them - might take on if
represented as Chronicles. An analysis of a simulated
population of 986 patient chronicles revealed:
• Individual patient chronicles comprise an average of
385 object instances and 753 semantic relations
• The most complicated patient story required 2295
instances and 4636 relationships
• The aggregated repository, comprising 986 discrete
networks, contained 382,103 instances and 742,595
semantic relations.
The prototype simulator and documentation is
available for download from:
Subsequent releases of the reimplementation may be
made available through the same URL in future.
We have now developed a more generalised Java
re-implementation of the simulator that is capable of
creating richer patient records, in a format suitable for
populating the Chronicle Object Model.
6. Future Challenges
Outstanding issues concerning the Chronicle
Representation include:
• How to store very large numbers of Chronicle
Object Model instances persistently such that queries
may be constructed, and efficiently executed, over
aggregations of such instances.
• How to implement such a query mechanism
involving both ontological and temporal reasoning.
• How to deal with data arising from the
‘chroniclisation’ of real records will almost certainly
be both incomplete and ‘fuzzy’.
Other issues concern the presentation of the chronicle
to the clinician, including:
• How the clinical content of a reconstructed chronicle
can be visualised in order to validate its content
against the more traditional patient record data from
which it has been derived.
• How complex queries over sets of patient chronicles
can best be formulated by the ordinary clinician.
7. Discussion
The idea of representing clinical information as
some form of semantic net, particularly focussing on
why things were done, is not new: echoes of it can be
found in Weed’s work on the problem oriented record
[7]. Ceusters and Smith more recently advocated the
resolution of coreferences in clinical records to
instance unique identifiers (IUIs) [8].
The semantic web initiative offers new possibilities
for implementing such an approach, but the lack of any
suitable clinical data severely constrains any practical
experimentation. The CLEF Simulator provides a
useful means to explore some of the computational and
representational issues that arise.
8. References
1. Taweel A, Rector, AL, Rogers J, Ingram D, Kalra D,
Gaizauskas R, Hepple M, Milan J, Power R, Scott D,
Singleton P. (2004) CLEF – Joining up Healthcare
with Clinical and Post-Genomic Research. Current
Perspectives in Healthcare Computing:203-211
2. Harkema H, Roberts I, Gaisauskas R, Hepple M.
(2005) A web service for biomedical term look-up.
Comparative and Functional Genomics 6;1-2:86-83
3. Rector A, Nowlan W, Kay S. (1991) Foundations for
an Electronic Medical Record. Methods of
Information in Medicine;30:179-86.
4. Beale T. (2003) Archetypes and the EHR. Stud
Health Technol Inform. 2003;96:238-44
5. Grenon P, Smith B (2004) SNAP and SPAN:
Towards Dynamic Spatial Ontology. Spatial
Cognition and Computation 4;1:69-104
6. Baader F, Calvanese D, McGuinness D, Nardi D,
Patel-Schneider P. (2003) The Description Logic
Handbook. Cambridge University Press. ISBN:
7. Weed LI (1969) Medical records medical education,
and patient care. The problem-oriented record as a
basic tool. Cleveland, OH: Case Western Reserve
8. Ceusters W, Smith B. (2005) Strategies for referent
tracking in Electronic Health Records. Journal of
Biomedical Informatics (in press).
9. Tsarkov D, Horrocks I. (2006) FaCT++ Description
Logic Reasoner: System Description. Proc of Int.
Joint Conf. on Automated Reasoning (IJCAR~2006)
(in press).