Determining the Syntactic Structure of Medical Terms in Clinical Notes Ted Pedersen

advertisement
Determining the Syntactic Structure
of Medical Terms in Clinical Notes
Bridget T. McInnes
Ted Pedersen
Serguei V. Pakhomov
bthomson@cs.umn.edu
Goal
The goal of this presentation is to present a
simple but effective approach to identify the
syntactic structure of three word terms
Importance

Potentially improve the analysis of unrestricted
medical text

Mapping of medical text to standardized
terminologies

Unsupervised syntactic parsing
Syntactic Structure of Terms
Monolithi
c
w1 w2 w3
blue = independence
green = dependence
Non-branching
w1 w2 w3
Left-branching
w1 w2 w3
Right-branching
w1 w2 w3
Example
small bowel obstruction
Syntactic Structure of Example
small bowel obstruction
Monolithi
c
Non-branching
Left-branching
Right-branching
small bowel obstruction small bowel obstruction small bowel obstruction small bowel obstruction
Method used to determine the
structure of a term
The Log Likelihood Ratio is the ratio between
the observed probability of a term occurring and
the probability it would be expected to occur
Probability of Term Occurring
----------------------------------Expected Probability of Term
Log Likelihood Ratio
The expected probability of a term is often based
on the Non-branching (Independence) Model
OBSERVED
PROBABILITY
P(small bowel obstruction)
----------------------------------P(small) P(bowel) P(obstruction)
EXPECTED
PROBABILITY
Extended Log Likelihood Ratio
The expected probabilities can be calculated
using two other hypothesis (models)
Non-branching
P(small)P(bowel)P(obstruction)
Left-branching
Right-branching
P(small bowel) P(obstruction) P(small) P(bowel obstruction)
Three Log Likelihood Ratio
Equations
Non-branching
P(small bowel obstruction)
----------------------------------P(small) P(bowel) P(obstruction)
Right-branching
P(small bowel obstruction)
----------------------------------P(small) P(bowel obstruction)
Left-branching
P(small bowel obstruction)
----------------------------------P(small bowel) P(obstruction)
Expected Probability
The expected probability of a term differs as
does the Log Likelihood Ratio
Non-branching
Left-branching
P(small) P(bowel) P(obstruction) P(small bowel) P(obstruction)
LL = 11,635.45
LL = 5,169.81
Right-branching
P(small) P(bowel obstruction)
LL = 8,532.90
Model Fitting
The model with the lowest Log Likelihood Ratio
best describes the underlying structure of the
term
Non-branching
Left-branching
P(small) P(bowel) P(obstruction) P(small bowel) P(obstruction)
LL = 11,635.45
LL = 5,169.81
Right-branching
P(small) P(bowel obstruction)
LL = 8,532.90
ReCap



The Log Likelihood Ratio is calculated for each
possible model

Non-branching

Right-branching

Left-branching
The probabilities for each model are obtained
from a corpus
The term is assigned the structure whose model
has the lowest Log Likelihood Ratio
Test Set
Contains 708 three word terms from the
SNOMED-CT
Monolithi
c
73 terms
Non-branching
6 terms
Left-branching
251 terms
Right-branching
378 terms
Test Set (cont)


Syntactic structure of each term was
determined through the consensus of two
medical text index experts (kappa = 0.704)
The probabilities were obtained from over
10,000 Mayo Clinic clinical notes
Monolithic Results
Agreement
74.8
80
Percentage agreement
with human experts
70
60
53.4
50
40
35.5
30
20
10
0
Left branching
Right branching
Technique
Our Method
Results without Monolithic Terms
Agreement
83.5
80
Percentage agreement
with human experts
70
59.5
60
50
39.5
40
30
20
10
0
Left branching
Right branching
Technique
Our Method
Limitations

Monolithic structures


possibly identify through collocation extraction or
dictionary lookup
As the number of words in a term grows so
does the number of hypothesis (models) to be
evaluated

only consider adjacent models

limit the length of the terms to 5 or 6 words
Conclusions



Present a simple but effective method to identify
the structure of three word terms
The method uses the Log Likelihood Ratio
Could be extended to identify the structure of
for four, five and six word terms
Future Work

Improve accuracy of method

explore other measures of association



Chi-squared, Phi, Dice coefficient ...
incorporate multiple measures together
Extend our method to four and five word terms

difficulty: finding a test set
Thank you
Software:
Ngram Statistic Package (NSP)
www.d.umn.edu/~tpederse/nsp.html
Log Likelihood Ratio Models
www.cs.umn.edu/~bthomson/mti.html
Log Likelihood Equation
2 * ∑xyz ( nxyz * log(nxyz / mxyz) )
Expected Values
2 * ∑xyz ( nxyz * log(nxyz / mxyz) )
Non-branching: mxyz = nx++ * n+y+ * n++z / n+++
Left-branching:
mxyz = nxy+ * n++z / n+++
Right-branching: mxyz = nx++ * n+yz / n+++
Download