TRANSFORMING MEDICAL RECORDS TO A COMPUTER USABLE DATA
BASE FORMAT
By W H Inmon
© copyright 2014 Forest Rim Technology, all rights reserved
Textual ETL is technology that reads raw, narrative text such as that found in medical
records and turns that text into a data base. A central and important part of the work that
textual ETL does is the disambiguation of the text found in the medical record that has
been read. The data base that is produced is created in a state that is said to be
“normalized” text.
THE TEXTUAL ETL PROCESS
A medical record is read and processed by Textual ETL. Fig 1 shows the essential
Textual ETL process.
Textual
ETL
The result of textual ETL is
“normalized” text
Fig 1
The result of the disambiguation process done by Textual ETL is the “normalization” of
text. The text is produced in a linear manner in a data base. While the data base is usable
that is produced by Textual ETL, the linearity of the data found in the data base makes it
less than intuitive to the neophyte. In order to make the data in the data base more usable
and more intuitive, it is necessary to restructure the data. Once restructured, the data is
much more “friendly” to the person needing to use the data.
RESTRUCTURING THE DISAMBIGUATED, NORMALIZED DATA
Fig 2 shows that the data coming out of Textual ETL is restructured into a more intuitive
format.
restructured
relational
text
The restructured disambiguated medical record
Fig 2
The restructured rows that are produced as a result of the restructuring look like that seen
in Fig 3.
© copyright 2014 Forest Rim Technology, all rights reserved
Source byte superclass subclass category negation word/phrase patient
Source byte superclass subclass category negation word/phrase patient
Source byte superclass subclass category negation word/phrase patient
Source byte superclass subclass category negation word/phrase patient
Source byte superclass subclass category negation word/phrase patient
...................................................................................................
.............
the contents of the restructured
disambiguated record
Fig 3
At first glance the rows that are produced are a simple, flat file. Initially it is not intuitive
that the rows contain anything terribly important or interesting. But on closer
examination, the rows that are produced by the Textual ETL/restructuring process are
very reflective of the narration found in the medical record.
In order to see the relationship between the source medical record and the restructured
data base that has been created, consider the following.
EXTRACTING WORDS AND PHRASES FROM THE MEDICAL RECORD
Fig 4 shows that the medical record has been scanned and analyzed, and that certain
words and phrases have been selected from the medical record for inclusion into the data
base.
Source byte superclass subclass category negation word/phrase patient
the word/phrase is extracted from the
source document
Fig 4
The word/phrase that has been selected for inclusion in the data base is the result of one
of many different types of processing done by Textual ETL. Some of the Textual ETL
processes that might be responsible for the selection of the word or phrase include
taxonomy resolution, homograph resolution, or acronym resolution. Or the word or
phrase might have been selected by Textual ETL because of proximity resolution, stop
word processing, custom variable processing or inline contextualization. Or there are
other techniques for selection for the word or phrase found in the medical record. Textual
ETL has selected the word or phrase because it is important in the medical record and
needs to be available to the research analyst.
© copyright 2014 Forest Rim Technology, all rights reserved
However and why ever the word or phrase was selected for the medical record, the word
or phrase is found in the row of data that has been extracted, as seen in the diagram.
PATIENT IDENTIFICATION
In the same row of data in the data base is found the patient identification, as seen in Fig
5. (Note: the patient identification has been blanked out here for the purposes of
preserving privacy.)
Source byte superclass subclass category negation word/phrase patient
the patient name is extracted for the medical
record and is attached to every word/phrase
Fig 5
It is seen in the figure that the patient identifier has been located in the medical record.
The patient identifier is then attached to every row in the data base belonging to the
medical record. Because the patient identifier and the word or phrase that is of interest is
found in the same row, it is very immediately and patently obvious for whom the word or
phrase was written in the medical record.
NEGATION OF WORD OR PHRASE
Another important piece of data is the negation of the term. Fig 6 shows that occasionally
a term found in a medical record will be negated by the doctor writing the medical
report..
© copyright 2014 Forest Rim Technology, all rights reserved
Source byte superclass subclass category negation word/phrase patient
if there is a negation of the term in the sentence
preceding the word/phrase, negation is noted
Fig 6
Occasionally a doctor will say – “The patient does not have angina.” In this case there is
a negation of the term that is found in the medical record. It is very straightforward and
obvious when a term – a word or phrase – has been negated because the negation appears
in the same row of data as the word or phrase being negated.
TAXONOMIC IDENTIFICATION OF A WORD OR PHRASE
Another important relationship of data is the taxonomical categorization of the word or
phrase. Fig 7 shows that the taxonomical categorization of the word or phrase is found in
the same row as the word or phrase.
Source byte superclass subclass category negation word/phrase patient
if the word/phrase belongs to a taxonomical
category, the category is listed here
Fig 7
Not all words or phrases have a taxonomical categorization. If that is the case this column
of data will be blank. But if a word or phrase has a taxonomical categorization, this is
where it will be found. Also note that on occasion a word or phrase will have more than
one taxonomical categorization. If that is the case there will be more than one row of data
© copyright 2014 Forest Rim Technology, all rights reserved
that has been created. There will be one row of data created for each taxonomical
categorization that applies to the word or phrase.
As a simple example of a taxonomical categorization, the word “Zofran” might be
classified as a medication.
The taxonomical categorization may appear by various means in Textual ETL. The
simplest and most common means by which taxonomical categorization appears is by
simple taxonomy resolution. But there are other techniques by which taxonomy
categorization appears as well.
Taxonomical categorization is most helpful in the disambiguation of text.
SUBCLASSIFICATION IN THE MEDICAL RECORD
Another important piece of data found in the restructured data base is that of the
subclassification of text created by the doctor making the medical record. Fig 8 shows the
subclassification of data.
Source byte superclass subclass category negation word/phrase patient
if the doctor has created a subclass of text,
the subclass is found here
Fig 8
A sub classification of data may be some topic such as “Nose”. The patient may have had
some condition that was notable that pertains to the nose. The doctor would simply
create a category of data for “Nose” – then the doctor would start to make comments
about the nose. If those comments included the word or phrase that has been selected,
then the subcategory would appear in this part of the data base record.
SUPER CLASSIFICATION OF TEXT IN THE MEDICAL RECORD
In line with the doctor’s creation of subcategories is the occasional “super category” of
text that is created. Fig 9 shows the super category of text and where it is placed in the
data base record.
© copyright 2014 Forest Rim Technology, all rights reserved
Source byte superclass subclass category negation word/phrase patient
if the doctor has created a superclass of text,
the superclass is found here
Fig 9
A super class of categorization might look like a doctors “Impression”, an “Assessment”,
or a “Treatment Plan”. The super categorization may or may not include one or more
subclassifications, depending on the doctor’s style in the creation of the medical record.
THE ORDER OF TEXT
Another important feature of the records of data that are created is the order in which the
doctor has created the record. Fig 10 shows that the sequence of terms created by the
doctor in the medical record is recorded and maintained.
Source byte superclass subclass category negation word/phrase patient
the sequencing of all word/phrases is found in the
byte field
Fig 10
IDENTIFYING THE MEDICAL RECORD
And a final important piece of information is that of the identification of the medical
record itself. Fig 11 shows that the identification of the medical record is retained for all
the entries in the data base.
© copyright 2014 Forest Rim Technology, all rights reserved
Source byte superclass subclass category negation word/phrase patient
the identification of the document is found here
Fig 11
It is seen then that there is a very close correlation between the important elements of the
medical record and the restructured data base that has been created. All of the
information needed by the research analyst is found in the same record. There is no
searching that is needed by the research analyst because all the pertinent data is held in
the same record. There are no “look ups” that are required. Because all the data that is
pertinent and important is included in a single row of the restructured, disambiguated
data, the processing required of the computer analysts is as straightforward as it can get.
Processing a data base analytically does not get to be any easier than reading a single
record and processing it.
ALL PERTINENT INFORMATION IN A SINGLE RECORD
Fig 12 shows that all the pertinent information needed for analysis that comes from the
medical record is found in the record itself.
Source byte superclass subclass category negation word/phrase patient
there is an ENORMOUS amount of context
found in a single row of data in the disambiguated
restructured data base
Fig 12
Another way of thinking about the relationship between the medical record and the
restructured data base is that the restructured data base is a mirror image rendered in the
form of a computer of the medical record.
Fig 13 shows this relationship.
© copyright 2014 Forest Rim Technology, all rights reserved
medical
record
restructured
data base
Bill Inmon is the founder of Forest Rim Technology located in Castle Rock, Colorado.
Forest Rim Technology produces textual ETL and the data base that can be restructured
from Textual ETL. With Textual ETL you can turn document oriented data into an
analytical data base that can be analyzed by the computer analyst.
© copyright 2014 Forest Rim Technology, all rights reserved