Adapting an Algorithm to a Corpus Peter Nelson Carleton College J. Starren, M.D., Ph.D. L. Rasmussen Project Purpose 2 In the context of a GWAS on hypothyroidism A particular natural language processing algorithm used to identify contextual features Discover and evaluate automatic and semi-automatic methods of adapting that algorithm to a corpus of medical records Project Motivation PMRP eMERGE Hypothyroidism GWAS – 3 Phenotyping Project Motivation - PMRP Marshfield Clinic PMRP – – – ~ 20,000 people from central WI EHR and blood samples Studies in the fields of: – 4 Population Genetics Genetic Epidemiology Pharmacogenetics Leverage genetic data to improve care Project Motivation - eMERGE eMERGE Network Organized by NHGRI Members – – – – – 5 Marshfield Clinic Vanderbilt Northwestern Mayo Clinic Group Health Cooperative Genome Wide Association Studies What is a GWAS? Why Do One? 6 “[A GWAS] involves rapidly scanning markers across the… genomes of many people to find genetic variations associated with a particular disease.” “[R]esearchers can use the information to develop better strategies to detect, treat and prevent the disease.” “…common, complex diseases, such as asthma, cancer, diabetes….” NHGRI website (http://www.genome.gov/20019523) Hypothyroidism GWAS 7 Insufficient hormone production by thyroid gland can cause fatigue, weight gain, and other symptoms. Diagnosable and treatable About 3% of American population have clinical condition Different Causes Hypothyroidism GWAS eMERGE Study – – – – 8 Identify patients with presumptive Hashimoto’s disease induced hypothyroidism (Cases) Identify patients with normal thyroid function (Controls) Genotype cases and controls (by testing for 100,000s of SNPs) Genome-wide association analysis Phenotyping in a GWAS Doctors design an algorithm for phenotyping based on the presence or absence of key procedures, medicines, and conditions in a patient’s medical history EHR is used as a resource – – – 9 Coded fields Unmarked text Images Manual vs Electronic Phenotyping Manual phenotyping by chart abstractors – – Accurate (Gold standard) Far too expensive (~20,000 medical records to process) Electronic phenotyping by computers – Methods – – 10 Query database of coded fields Natural language processing on free text OCR and Image Processing on other resources Comparatively cheap Sample must be validated by chart abstractors Natural Language Processing 11 What is it? What problems must be solved? How can they be solved? Natural Language Processing Search for concepts in free text of EHR Simple keyword search insufficient – – – – – 12 “There was no evidence of polyps or ulceration.” “Rule out H. pylori, gastritis and gastropathy.” “She should return to the Emergency Department if she experiences nausea or vomiting.” “Patient should avoid any tests which involve the use of iodinated contrast material” “The indication for this procedure is family history of colon cancer.” Natural Language Processing Search for concepts in free text of EHR Negated – – – – – 13 “There was no evidence of polyps or ulceration.” “Rule out H. pylori, gastritis and gastropathy.” “She should return to the Emergency Department if she experiences nausea or vomiting.” “Patient should avoid any tests which involve the use of iodinated contrast material” “The indication for this procedure is family history of colon cancer.” Natural Language Processing Search for concepts in free text of EHR Hypothetical – – – – – 14 “There was no evidence of polyps or ulceration.” “Rule out H. pylori, gastritis and gastropathy.” “She should return to the Emergency Department if she experiences nausea or vomiting.” “Patient should avoid any tests which involve the use of iodinated contrast material” “The indication for this procedure is family history of colon cancer.” Natural Language Processing Search for concepts in free text of EHR Family History – – – – – 15 “There was no evidence of polyps or ulceration.” “Rule out H. pylori, gastritis and gastropathy.” “She should return to the Emergency Department if she experiences nausea or vomiting.” “Patient should avoid any tests which involve the use of iodinated contrast material” “The indication for this procedure is family history of colon cancer.” NegEx Simple Performs well – – – Recently extended – – 16 Against gold standard Against MedLEE Against straight statistical methods Hypothetical & Family History “ConText” NegEx “There was no evidence of polyps or ulceration.” 17 NegEx “There was no evidence of polyps or ulceration.” 18 NegEx “There was no evidence of polyps or ulceration.” ................................................. | 19 NegEx “There was no evidence of polyps or ulceration.” ................................................. | 20 NegEx “Rule out H. pylori, gastritis, and gastropathy.” 21 NegEx “Rule out H. pylori, gastritis, and gastropathy.” ………………………………………| 22 NegEx “Quantitative PCR testing for BK Virus is negative.” 23 NegEx “Quantitative PCR testing for BK Virus is negative.” |………………………………………………… 24 NegEx 25 “No evidence of spread of cancer to the lungs.” “No residua of healed fractures can be seen otherwise.” NegEx 26 ………………………………………………..| “No evidence of spread of cancer to the lungs.” …………………………………………………………| “No residua of healed fractures can be seen otherwise.” NegEx 27 “No evidence of spread of cancer to the lungs.” “No residua of healed fractures can be seen otherwise.” NegEx 28 NegEx, and therefore ConText, require carefully tuned lists of triggers and pseudotriggers. How big must a list be to perform well? 29 Scenarios 30 Annotated training set used to populate lists Large unmarked training set used to extend existing lists Using Annotated Data 31 NegEx/ConText creators provide annotated excerpts from medical records Look for associations between words and negation to populate list of triggers Look for associations between words near triggers and false positives to populate list of pseudotriggers Identifying Triggers 32 Create a confusion matrix for each word Sort words by some statistic based on these confusion matrices Select or reject top candidate as a trigger Repeat on yet unexplained sentences until stopping condition met Actual Classification + - TP FP FN TN + Predicted Classification - Identifying Triggers Statistical measures used – – – – 33 Log-likelihood ratio Precision (PPV) Recall (Sensitivity) F-measure Log-Likelihood Ratio LLR Precision Recall F-measure no 1763.2 95.3 69.9 80.6 Total 0.0 0.0 0.0 0.0 34 Triggers: { } Log-Likelihood Ratio Total 35 LLR Precision Recall F-measure 1763.2 95.3 69.9 80.6 Triggers: { no } Log-Likelihood Ratio LLR Precision Recall F-measure denies 617.8 100.0 50.0 66.7 Total 1763.2 95.3 69.9 80.6 36 Triggers: { no } Log-Likelihood Ratio Total 37 LLR Precision Recall F-measure 2371.6 96.1 84.9 90.2 Triggers: { no, denies } Log-Likelihood Ratio LLR Precision Recall F-measure not 179.4 70.6 32.4 44.4 Total 2371.6 96.1 84.9 90.2 38 Triggers: { no, denies } Log-Likelihood Ratio Total 39 LLR Precision Recall F-measure 2519.5 94.2 89.8 92.0 Triggers: { no, denies, not } Log-Likelihood Ratio LLR Precision Recall F-measure denied 187.6 100.0 34.0 50.7 Total 2519.5 94.2 89.8 92.0 40 Triggers: { no, denies, not } Log-Likelihood Ratio Total 41 LLR Precision Recall F-measure 2704.2 94.4 93.3 93.9 Triggers: { no, denies, not, denied } Log-Likelihood Ratio LLR Precision Recall F-measure without 79.9 60.0 27.3 37.5 Total 2704.2 94.4 93.3 93.9 42 Triggers: { no, denies, not, denied } Log-Likelihood Ratio Total 43 LLR Precision Recall F-measure 2763.2 93.4 95.1 94.2 Triggers: { no, denies, not, denied, without } Log-Likelihood Ratio LLR Precision Recall F-measure negative 77.7 100.0 25.0 40.0 Total 2763.2 93.4 95.1 94.2 44 Triggers: { no, denies, not, denied, without } Log-Likelihood Ratio Total 45 LLR Precision Recall F-measure 2839.7 93.5 96.3 94.9 Triggers: { no, denies, not, denied, without, negative } Log-Likelihood Ratio LLR Precision Recall F-measure resolved 61.3 83.3 27.8 41.7 Total 2839.7 93.5 96.3 94.9 46 Triggers: { no, denies, not, denied, without, negative } Log-Likelihood Ratio Total 47 LLR Precision Recall F-measure 2900.0 93.4 97.4 95.3 Triggers: { no, denies, not, denied, without, negative, resolved (post) } Log-Likelihood Ratio LLR Precision Recall F-measure 4-way tie! - - - - Total 2900.0 93.4 97.4 95.3 48 Triggers: { no, denies, not, denied, without, negative, resolved (post) } Other Measures Precision (PPV) – – Recall (sensitivity) – – – Catches all the same ones as LLR Also finds “any”, “the”, and “for” Imprecise metric F-measure – – 49 271 tie for 100% Poor metric Identical results to LLR Good metric Identifying Pseudotriggers 50 Use analogous method to find words that predict false-positives Limit to words next to triggers Filter out prospects with low precision Sort by LLR Identifying Pseudotriggers Some real pseudotriggers – – Some that should be considered for addition to the list of pseudotriggers – – “not know” “no additional” Some entirely anomalous pseudotriggers – 51 “no residua” “without difficulty” “no hepatosplenomegaly” Further Work 52 Formalize stopping condition Try other statistical measures Can potential pseudotriggers be further explored using unannotated EHR? Evaluate the finished algorithm on ConText data Using Unmarked Data Many pseudotriggers are variations on other pseudotriggers – – – 53 “No change” “No significant change” “No increase” Could a large unmarked corpus of EHR be searched for variations on pseudotriggers? Phrase Comparison Methods 54 Edit Distance N-gram similarity, Set similarity Vector based methods Word Comparison Methods Path-based methods – – – Path-based, with IC – – – Resnik Jiang-Conrath Lin Gloss-based – – – 55 Path Wu-Palmer Leacock-Chodorow Lesk (and Lesk Extended) Gloss-Vector LSA Preliminary Results 56 Edit distance seems to be a poor phrase comparison metric Path-based measures seem to be poor word comparison metrics Further Work 57 Explore gloss-based measures of word similarity Explore other measures of phrase similarity other than edit distance Evaluate the finished metric on ConText lists Validation 58 Take algorithms developed on NegEx and apply them to ConText Have chart abstractors evaluate terms from some documents in the hypothyroidism GWAS. Compare performance of unmodified ConText with that of extended version(s) Results/Conclusion 59 The study is ongoing; no final results are available The methods described in this presentation show promise, but they must be validated before any conclusions can be drawn If the phrase comparison metric performs well, it could potentially be used to solve smoothing problems in n-gram models. N-gram Interlude N-gram models estimate probability based on leading context: – – Many applications – – – 60 “Class, please hand your homework ___” “I heard a sharp rap on the ___” Machine translation OCR, speech recognition, spell checking Identifying pathogical islands in virus and bacteria genomes, Predicting protein folding N-gram Interlude As size of the n-grams (i.e., n) increases – – – – 61 Performance improves Number of parameters increases exponentially Size of data set necessary to accurately estimate parameters becomes impossibly large Missing parameters must be estimated based on existing ones (Smoothing) Could smoothing be based on a phrase similarity metric? Bibliography 62 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001;34(5):301–10. Chu D, Dowling JN, Chapman WW. Evaluating the effectiveness of four contextual features in classifying annotated clinical conditions in emergency department reports. AMIA Annu Symp Proc 2006:141–5. Goryachev S, Kim H, Zeng-Treitler Q. 2008. Identification and extraction of family history information from clinical reports. In proceedings of AMIA Annu Symp Proc. 2008 Nov 6:247-51. Goryachev S, Sordo M, Zeng QT, and Ngo L. 2006. Implementation and evaluation of four different methods of negation detection. Technical report, DSG. Harkema H et al. ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform (2009), doi:10.1016/j.jbi.2009.05.002 Pedersen, Ted. 1996. Fishing for exactness. In Proceedings of the South-Central SAS Users Group Conference, pages 188--200, Austin, TX. Xu H, Anderson K, Grann VR, Friedman C. Facilitating cancer research using natural language processing of pathology reports. Medinfo 2004;2004:565-72. Acknowledgements Luke Rasmussen Laura Coleman & Ruth Zetek Justin Starren MCRF Donors Creators and Maintainers of – – – 63 NegEx/ConText : W. Chapman, H. Harkema, X. Shen, Kang NLTK : S. Bird, E. Klein, E. Loper, et al. WordNet Similarity : Ted Pedersen, et al. P.