TRANSFORMING MEDICAL RECORDS TO A COMPUTER USABLE DATA BASE FORMAT By W H Inmon © copyright 2014 Forest Rim Technology, all rights reserved Textual ETL is technology that reads raw, narrative text such as that found in medical records and turns that text into a data base. A central and important part of the work that textual ETL does is the disambiguation of the text found in the medical record that has been read. The data base that is produced is created in a state that is said to be “normalized” text. THE TEXTUAL ETL PROCESS A medical record is read and processed by Textual ETL. Fig 1 shows the essential Textual ETL process. Textual ETL The result of textual ETL is “normalized” text Fig 1 The result of the disambiguation process done by Textual ETL is the “normalization” of text. The text is produced in a linear manner in a data base. While the data base is usable that is produced by Textual ETL, the linearity of the data found in the data base makes it less than intuitive to the neophyte. In order to make the data in the data base more usable and more intuitive, it is necessary to restructure the data. Once restructured, the data is much more “friendly” to the person needing to use the data. RESTRUCTURING THE DISAMBIGUATED, NORMALIZED DATA Fig 2 shows that the data coming out of Textual ETL is restructured into a more intuitive format. restructured relational text The restructured disambiguated medical record Fig 2 The restructured rows that are produced as a result of the restructuring look like that seen in Fig 3. © copyright 2014 Forest Rim Technology, all rights reserved Source byte superclass subclass category negation word/phrase patient Source byte superclass subclass category negation word/phrase patient Source byte superclass subclass category negation word/phrase patient Source byte superclass subclass category negation word/phrase patient Source byte superclass subclass category negation word/phrase patient ................................................................................................... ............. the contents of the restructured disambiguated record Fig 3 At first glance the rows that are produced are a simple, flat file. Initially it is not intuitive that the rows contain anything terribly important or interesting. But on closer examination, the rows that are produced by the Textual ETL/restructuring process are very reflective of the narration found in the medical record. In order to see the relationship between the source medical record and the restructured data base that has been created, consider the following. EXTRACTING WORDS AND PHRASES FROM THE MEDICAL RECORD Fig 4 shows that the medical record has been scanned and analyzed, and that certain words and phrases have been selected from the medical record for inclusion into the data base. Source byte superclass subclass category negation word/phrase patient the word/phrase is extracted from the source document Fig 4 The word/phrase that has been selected for inclusion in the data base is the result of one of many different types of processing done by Textual ETL. Some of the Textual ETL processes that might be responsible for the selection of the word or phrase include taxonomy resolution, homograph resolution, or acronym resolution. Or the word or phrase might have been selected by Textual ETL because of proximity resolution, stop word processing, custom variable processing or inline contextualization. Or there are other techniques for selection for the word or phrase found in the medical record. Textual ETL has selected the word or phrase because it is important in the medical record and needs to be available to the research analyst. © copyright 2014 Forest Rim Technology, all rights reserved However and why ever the word or phrase was selected for the medical record, the word or phrase is found in the row of data that has been extracted, as seen in the diagram. PATIENT IDENTIFICATION In the same row of data in the data base is found the patient identification, as seen in Fig 5. (Note: the patient identification has been blanked out here for the purposes of preserving privacy.) Source byte superclass subclass category negation word/phrase patient the patient name is extracted for the medical record and is attached to every word/phrase Fig 5 It is seen in the figure that the patient identifier has been located in the medical record. The patient identifier is then attached to every row in the data base belonging to the medical record. Because the patient identifier and the word or phrase that is of interest is found in the same row, it is very immediately and patently obvious for whom the word or phrase was written in the medical record. NEGATION OF WORD OR PHRASE Another important piece of data is the negation of the term. Fig 6 shows that occasionally a term found in a medical record will be negated by the doctor writing the medical report.. © copyright 2014 Forest Rim Technology, all rights reserved Source byte superclass subclass category negation word/phrase patient if there is a negation of the term in the sentence preceding the word/phrase, negation is noted Fig 6 Occasionally a doctor will say – “The patient does not have angina.” In this case there is a negation of the term that is found in the medical record. It is very straightforward and obvious when a term – a word or phrase – has been negated because the negation appears in the same row of data as the word or phrase being negated. TAXONOMIC IDENTIFICATION OF A WORD OR PHRASE Another important relationship of data is the taxonomical categorization of the word or phrase. Fig 7 shows that the taxonomical categorization of the word or phrase is found in the same row as the word or phrase. Source byte superclass subclass category negation word/phrase patient if the word/phrase belongs to a taxonomical category, the category is listed here Fig 7 Not all words or phrases have a taxonomical categorization. If that is the case this column of data will be blank. But if a word or phrase has a taxonomical categorization, this is where it will be found. Also note that on occasion a word or phrase will have more than one taxonomical categorization. If that is the case there will be more than one row of data © copyright 2014 Forest Rim Technology, all rights reserved that has been created. There will be one row of data created for each taxonomical categorization that applies to the word or phrase. As a simple example of a taxonomical categorization, the word “Zofran” might be classified as a medication. The taxonomical categorization may appear by various means in Textual ETL. The simplest and most common means by which taxonomical categorization appears is by simple taxonomy resolution. But there are other techniques by which taxonomy categorization appears as well. Taxonomical categorization is most helpful in the disambiguation of text. SUBCLASSIFICATION IN THE MEDICAL RECORD Another important piece of data found in the restructured data base is that of the subclassification of text created by the doctor making the medical record. Fig 8 shows the subclassification of data. Source byte superclass subclass category negation word/phrase patient if the doctor has created a subclass of text, the subclass is found here Fig 8 A sub classification of data may be some topic such as “Nose”. The patient may have had some condition that was notable that pertains to the nose. The doctor would simply create a category of data for “Nose” – then the doctor would start to make comments about the nose. If those comments included the word or phrase that has been selected, then the subcategory would appear in this part of the data base record. SUPER CLASSIFICATION OF TEXT IN THE MEDICAL RECORD In line with the doctor’s creation of subcategories is the occasional “super category” of text that is created. Fig 9 shows the super category of text and where it is placed in the data base record. © copyright 2014 Forest Rim Technology, all rights reserved Source byte superclass subclass category negation word/phrase patient if the doctor has created a superclass of text, the superclass is found here Fig 9 A super class of categorization might look like a doctors “Impression”, an “Assessment”, or a “Treatment Plan”. The super categorization may or may not include one or more subclassifications, depending on the doctor’s style in the creation of the medical record. THE ORDER OF TEXT Another important feature of the records of data that are created is the order in which the doctor has created the record. Fig 10 shows that the sequence of terms created by the doctor in the medical record is recorded and maintained. Source byte superclass subclass category negation word/phrase patient the sequencing of all word/phrases is found in the byte field Fig 10 IDENTIFYING THE MEDICAL RECORD And a final important piece of information is that of the identification of the medical record itself. Fig 11 shows that the identification of the medical record is retained for all the entries in the data base. © copyright 2014 Forest Rim Technology, all rights reserved Source byte superclass subclass category negation word/phrase patient the identification of the document is found here Fig 11 It is seen then that there is a very close correlation between the important elements of the medical record and the restructured data base that has been created. All of the information needed by the research analyst is found in the same record. There is no searching that is needed by the research analyst because all the pertinent data is held in the same record. There are no “look ups” that are required. Because all the data that is pertinent and important is included in a single row of the restructured, disambiguated data, the processing required of the computer analysts is as straightforward as it can get. Processing a data base analytically does not get to be any easier than reading a single record and processing it. ALL PERTINENT INFORMATION IN A SINGLE RECORD Fig 12 shows that all the pertinent information needed for analysis that comes from the medical record is found in the record itself. Source byte superclass subclass category negation word/phrase patient there is an ENORMOUS amount of context found in a single row of data in the disambiguated restructured data base Fig 12 Another way of thinking about the relationship between the medical record and the restructured data base is that the restructured data base is a mirror image rendered in the form of a computer of the medical record. Fig 13 shows this relationship. © copyright 2014 Forest Rim Technology, all rights reserved medical record restructured data base Bill Inmon is the founder of Forest Rim Technology located in Castle Rock, Colorado. Forest Rim Technology produces textual ETL and the data base that can be restructured from Textual ETL. With Textual ETL you can turn document oriented data into an analytical data base that can be analyzed by the computer analyst. © copyright 2014 Forest Rim Technology, all rights reserved