slides - Peter The Nelson

advertisement
Adapting an Algorithm to a Corpus
Peter Nelson
Carleton College
J. Starren, M.D., Ph.D.
L. Rasmussen
Project Purpose
2

In the context of a GWAS on hypothyroidism

A particular natural language processing
algorithm used to identify contextual features

Discover and evaluate automatic and
semi-automatic methods of adapting that
algorithm to a corpus of medical records
Project Motivation

PMRP

eMERGE

Hypothyroidism GWAS
–
3
Phenotyping
Project Motivation - PMRP

Marshfield Clinic PMRP
–
–
–
~ 20,000 people from central WI
EHR and blood samples
Studies in the fields of:



–
4
Population Genetics
Genetic Epidemiology
Pharmacogenetics
Leverage genetic data to improve care
Project Motivation - eMERGE



eMERGE Network
Organized by NHGRI
Members
–
–
–
–
–

5
Marshfield Clinic
Vanderbilt
Northwestern
Mayo Clinic
Group Health Cooperative
Genome Wide Association Studies
What is a GWAS? Why Do One?



6
“[A GWAS] involves rapidly scanning markers across
the… genomes of many people to find genetic
variations associated with a particular disease.”
“[R]esearchers can use the information to develop
better strategies to detect, treat and prevent the
disease.”
“…common, complex diseases, such as asthma,
cancer, diabetes….”
NHGRI website
(http://www.genome.gov/20019523)
Hypothyroidism GWAS




7
Insufficient hormone production by thyroid
gland can cause fatigue, weight gain, and
other symptoms.
Diagnosable and treatable
About 3% of American population have
clinical condition
Different Causes
Hypothyroidism GWAS

eMERGE Study
–
–
–
–
8
Identify patients with presumptive Hashimoto’s
disease induced hypothyroidism (Cases)
Identify patients with normal thyroid function
(Controls)
Genotype cases and controls (by testing for
100,000s of SNPs)
Genome-wide association analysis
Phenotyping in a GWAS

Doctors design an algorithm for phenotyping based
on the presence or absence of key procedures,
medicines, and conditions in a patient’s medical
history

EHR is used as a resource
–
–
–
9
Coded fields
Unmarked text
Images
Manual vs Electronic Phenotyping

Manual phenotyping by chart abstractors
–
–

Accurate (Gold standard)
Far too expensive (~20,000 medical records to process)
Electronic phenotyping by computers
–
Methods



–
–
10
Query database of coded fields
Natural language processing on free text
OCR and Image Processing on other resources
Comparatively cheap
Sample must be validated by chart abstractors
Natural Language Processing
11

What is it?

What problems must be solved?

How can they be solved?
Natural Language Processing

Search for concepts in free text of EHR

Simple keyword search insufficient
–
–
–
–
–
12
“There was no evidence of polyps or ulceration.”
“Rule out H. pylori, gastritis and gastropathy.”
“She should return to the Emergency Department if she
experiences nausea or vomiting.”
“Patient should avoid any tests which involve the use of iodinated
contrast material”
“The indication for this procedure is family history of colon cancer.”
Natural Language Processing

Search for concepts in free text of EHR

Negated
–
–
–
–
–
13
“There was no evidence of polyps or ulceration.”
“Rule out H. pylori, gastritis and gastropathy.”
“She should return to the Emergency Department if she
experiences nausea or vomiting.”
“Patient should avoid any tests which involve the use of iodinated
contrast material”
“The indication for this procedure is family history of colon cancer.”
Natural Language Processing

Search for concepts in free text of EHR

Hypothetical
–
–
–
–
–
14
“There was no evidence of polyps or ulceration.”
“Rule out H. pylori, gastritis and gastropathy.”
“She should return to the Emergency Department if she
experiences nausea or vomiting.”
“Patient should avoid any tests which involve the use of iodinated
contrast material”
“The indication for this procedure is family history of colon cancer.”
Natural Language Processing

Search for concepts in free text of EHR

Family History
–
–
–
–
–
15
“There was no evidence of polyps or ulceration.”
“Rule out H. pylori, gastritis and gastropathy.”
“She should return to the Emergency Department if she
experiences nausea or vomiting.”
“Patient should avoid any tests which involve the use of iodinated
contrast material”
“The indication for this procedure is family history of colon cancer.”
NegEx


Simple
Performs well
–
–
–

Recently extended
–
–
16
Against gold standard
Against MedLEE
Against straight statistical methods
Hypothetical & Family History
“ConText”
NegEx
“There was no evidence of polyps or ulceration.”
17
NegEx
“There was no evidence of polyps or ulceration.”
18
NegEx
“There was no evidence of polyps or ulceration.”
 ................................................. |
19
NegEx
“There was no evidence of polyps or ulceration.”
 ................................................. |
20
NegEx
“Rule out H. pylori, gastritis, and gastropathy.”
21
NegEx
“Rule out H. pylori, gastritis, and gastropathy.”
………………………………………|
22
NegEx
“Quantitative PCR testing for BK Virus is negative.”
23
NegEx
“Quantitative PCR testing for BK Virus is negative.”
|…………………………………………………
24
NegEx
25

“No evidence of spread of cancer to the lungs.”

“No residua of healed fractures can be seen otherwise.”
NegEx
26

………………………………………………..|
“No evidence of spread of cancer to the lungs.”

…………………………………………………………|
“No residua of healed fractures can be seen otherwise.”
NegEx
27

“No evidence of spread of cancer to the lungs.”

“No residua of healed fractures can be seen otherwise.”
NegEx
28

NegEx, and therefore ConText, require
carefully tuned lists of triggers and
pseudotriggers.

How big must a list be to perform well?
29
Scenarios
30

Annotated training set used to populate lists

Large unmarked training set used to extend
existing lists
Using Annotated Data
31

NegEx/ConText creators provide annotated excerpts
from medical records

Look for associations between words and negation
to populate list of triggers

Look for associations between words near triggers
and false positives to populate list of pseudotriggers
Identifying Triggers




32
Create a confusion matrix
for each word
Sort words by some statistic
based on these confusion
matrices
Select or reject top
candidate as a trigger
Repeat on yet unexplained
sentences until stopping
condition met
Actual Classification
+
-
TP
FP
FN
TN
+
Predicted
Classification
-
Identifying Triggers

Statistical measures used
–
–
–
–
33
Log-likelihood ratio
Precision (PPV)
Recall (Sensitivity)
F-measure
Log-Likelihood Ratio
LLR
Precision
Recall
F-measure
no
1763.2
95.3
69.9
80.6
Total
0.0
0.0
0.0
0.0

34
Triggers: { }
Log-Likelihood Ratio
Total

35
LLR
Precision
Recall
F-measure
1763.2
95.3
69.9
80.6
Triggers: { no }
Log-Likelihood Ratio
LLR
Precision
Recall
F-measure
denies
617.8
100.0
50.0
66.7
Total
1763.2
95.3
69.9
80.6

36
Triggers: { no }
Log-Likelihood Ratio
Total

37
LLR
Precision
Recall
F-measure
2371.6
96.1
84.9
90.2
Triggers: { no, denies }
Log-Likelihood Ratio
LLR
Precision
Recall
F-measure
not
179.4
70.6
32.4
44.4
Total
2371.6
96.1
84.9
90.2

38
Triggers: { no, denies }
Log-Likelihood Ratio
Total

39
LLR
Precision
Recall
F-measure
2519.5
94.2
89.8
92.0
Triggers: { no, denies, not }
Log-Likelihood Ratio
LLR
Precision
Recall
F-measure
denied
187.6
100.0
34.0
50.7
Total
2519.5
94.2
89.8
92.0

40
Triggers: { no, denies, not }
Log-Likelihood Ratio
Total

41
LLR
Precision
Recall
F-measure
2704.2
94.4
93.3
93.9
Triggers: { no, denies, not, denied }
Log-Likelihood Ratio
LLR
Precision
Recall
F-measure
without
79.9
60.0
27.3
37.5
Total
2704.2
94.4
93.3
93.9

42
Triggers: { no, denies, not, denied }
Log-Likelihood Ratio
Total

43
LLR
Precision
Recall
F-measure
2763.2
93.4
95.1
94.2
Triggers: { no, denies, not, denied, without }
Log-Likelihood Ratio
LLR
Precision
Recall
F-measure
negative
77.7
100.0
25.0
40.0
Total
2763.2
93.4
95.1
94.2

44
Triggers: { no, denies, not, denied, without }
Log-Likelihood Ratio
Total

45
LLR
Precision
Recall
F-measure
2839.7
93.5
96.3
94.9
Triggers: { no, denies, not, denied, without,
negative }
Log-Likelihood Ratio
LLR
Precision
Recall
F-measure
resolved
61.3
83.3
27.8
41.7
Total
2839.7
93.5
96.3
94.9

46
Triggers: { no, denies, not, denied, without,
negative }
Log-Likelihood Ratio
Total

47
LLR
Precision
Recall
F-measure
2900.0
93.4
97.4
95.3
Triggers: { no, denies, not, denied, without,
negative, resolved (post) }
Log-Likelihood Ratio
LLR
Precision
Recall
F-measure
4-way
tie!
-
-
-
-
Total
2900.0
93.4
97.4
95.3

48
Triggers: { no, denies, not, denied, without,
negative, resolved (post) }
Other Measures

Precision (PPV)
–
–

Recall (sensitivity)
–
–
–

Catches all the same ones as LLR
Also finds “any”, “the”, and “for”
Imprecise metric
F-measure
–
–
49
271 tie for 100%
Poor metric
Identical results to LLR
Good metric
Identifying Pseudotriggers
50

Use analogous method to find words that predict
false-positives

Limit to words next to triggers

Filter out prospects with low precision

Sort by LLR
Identifying Pseudotriggers

Some real pseudotriggers
–
–

Some that should be considered for addition to the
list of pseudotriggers
–
–

“not know”
“no additional”
Some entirely anomalous pseudotriggers
–
51
“no residua”
“without difficulty”
“no hepatosplenomegaly”
Further Work




52
Formalize stopping condition
Try other statistical measures
Can potential pseudotriggers be further
explored using unannotated EHR?
Evaluate the finished algorithm on ConText
data
Using Unmarked Data

Many pseudotriggers are variations on other
pseudotriggers
–
–
–

53
“No change”
“No significant change”
“No increase”
Could a large unmarked corpus of EHR be
searched for variations on pseudotriggers?
Phrase Comparison Methods
54

Edit Distance

N-gram similarity, Set similarity

Vector based methods
Word Comparison Methods

Path-based methods
–
–
–

Path-based, with IC
–
–
–

Resnik
Jiang-Conrath
Lin
Gloss-based
–
–
–
55
Path
Wu-Palmer
Leacock-Chodorow
Lesk (and Lesk Extended)
Gloss-Vector
LSA
Preliminary Results
56

Edit distance seems to be a poor phrase
comparison metric

Path-based measures seem to be poor word
comparison metrics
Further Work
57

Explore gloss-based measures of word
similarity

Explore other measures of phrase similarity
other than edit distance

Evaluate the finished metric on ConText lists
Validation
58

Take algorithms developed on NegEx and
apply them to ConText

Have chart abstractors evaluate terms from
some documents in the hypothyroidism
GWAS. Compare performance of unmodified
ConText with that of extended version(s)
Results/Conclusion
59

The study is ongoing; no final results are available

The methods described in this presentation show
promise, but they must be validated before any
conclusions can be drawn

If the phrase comparison metric performs well, it
could potentially be used to solve smoothing
problems in n-gram models.
N-gram Interlude

N-gram models estimate probability based
on leading context:
–
–

Many applications
–
–
–
60
“Class, please hand your homework ___”
“I heard a sharp rap on the ___”
Machine translation
OCR, speech recognition, spell checking
Identifying pathogical islands in virus and bacteria
genomes, Predicting protein folding
N-gram Interlude

As size of the n-grams (i.e., n) increases
–
–
–
–

61
Performance improves
Number of parameters increases exponentially
Size of data set necessary to accurately estimate
parameters becomes impossibly large
Missing parameters must be estimated based on
existing ones (Smoothing)
Could smoothing be based on a phrase
similarity metric?
Bibliography
62

Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying
negated findings and diseases in discharge summaries. J Biomed Inform 2001;34(5):301–10.

Chu D, Dowling JN, Chapman WW. Evaluating the effectiveness of four contextual features in classifying
annotated clinical conditions in emergency department reports. AMIA Annu Symp Proc 2006:141–5.

Goryachev S, Kim H, Zeng-Treitler Q. 2008. Identification and extraction of family history information from
clinical reports. In proceedings of AMIA Annu Symp Proc. 2008 Nov 6:247-51.

Goryachev S, Sordo M, Zeng QT, and Ngo L. 2006. Implementation and evaluation of four different
methods of negation detection. Technical report, DSG.

Harkema H et al. ConText: An algorithm for determining negation, experiencer, and temporal status from
clinical reports. J Biomed Inform (2009), doi:10.1016/j.jbi.2009.05.002

Pedersen, Ted. 1996. Fishing for exactness. In Proceedings of the South-Central SAS Users Group
Conference, pages 188--200, Austin, TX.

Xu H, Anderson K, Grann VR, Friedman C. Facilitating cancer research using natural language processing
of pathology reports. Medinfo 2004;2004:565-72.
Acknowledgements






Luke Rasmussen
Laura Coleman & Ruth Zetek
Justin Starren
MCRF
Donors
Creators and Maintainers of
–
–
–
63
NegEx/ConText : W. Chapman, H. Harkema, X. Shen,
Kang
NLTK : S. Bird, E. Klein, E. Loper, et al.
WordNet Similarity : Ted Pedersen, et al.
P.
Download