Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes 1 Overview Ngrams Ngram Statistics for Spelling Correction Spelling Correction Ngram Statistics for Multi Term Identification Multi Term Identification 2 Ngram Her dobutamine stress echo showed mild aortic stenosis with a subaortic gradient. Bigrams Her dobutamine Dobutamine stress Stress echo Echo showed Showed mild Mild aortic Aortic stenosis Stenosis with With a A subaortic Subaortic gradient Trigrams her dobutamine stress dobutamine stress echo stress echo showed echo showed mild showed mild aortic mild aortic stenosis aortic stenosis with stenosis with a a subaortic gradient 3 Contingency Tables Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp • n11 = the joint frequency of word1 and word2 • n12 = the frequency word 1 occurs and word 2 does not • n21 = the frequency word 2 occurs and word 1 does not • n22 = the frequency word 1 and word 2 do not occur • npp = the total number of ngrams • n1p, np1, np2, n2p are the marginal counts 4 Contingency Tables echo ! echo stress 1 0 1 !stress 0 10 10 1 10 Her dobutamine Dobutamine stress Stress echo Echo showed Showed mild Mild aortic Aortic stenosis Stenosis with With a A subaortic Subaortic gradient 11 1 1 1 1 1 1 1 1 1 1 1 5 Contingency Tables Expected Values Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp • Expected Values • m11 = (np1 * n1p) / npp • m12 = (np2 * n1p) / npp • m21 = (np1 * n2p) / npp • m22 = (np2 * n2p) / npp 6 Contingency Tables echo ! echo stress 1 0 1 !stress 0 10 10 1 10 11 • Expected Values • m11 = ( 1 * 1 ) / 11 = 0.09 What is this telling you? • m12 = ( 1 * 10) / 11 = 0.91 ‘this is’ occurs twice in our example. • m21 = ( 1 * 10) / 11 = 0.90 The expected occurrence of ‘this is’ if they are independent is .09 (m11). • m22 = (10 * 10) / 11 = 9.09 7 Ngram Statistics Measures of Association Log Likelihood Ratio Chi Squared Test Odds Ratio Phi Coefficient T-Score Dice Coefficient True Mutual Information 8 Log Likelihood Ratio Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Log Likelihood = 2 * ∑ ( nij * log( nij / mij) ) The log likelihood ratio measures the difference between the observed values and the expected values. It is the sum of the ratio of the observed and expected values 9 Chi Squared Test Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp x2 = ∑ pow( (nij – mij), 2) / mij The chi squared test also measures the difference between the observed values and the expected values. It is the sum of the difference between the observed and expected values 10 Odds Ratio Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Odds Ratio = (n11 * n22) / (n21 * n12) The odds ratio is the ratio is the total number of times an event takes place to the total number of times that it does not take place. It is the cross product ratio of the 2x2 contingency table and measures the magnitude of association between two words 11 Phi Coefficient Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Phi = ( (n11 * n22) - (n21 * n12) ) / Sqrt(np1 * n1p * n2p * np2) The bigrams are considered positively associated if most of data is along the diagonal (meaning if n11 and n22 are larger than n12 and n21) and negatively associated if most of the data falls off the diagonal. 12 T Score Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp T Score = ( n11 – m11 ) / sqrt( n11 ) The tscore determines whether there is some non random association between two words. It is the quotient of your known and expected divided by the square root of your known 13 Dice Coefficient Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp Dice coefficient = 2 * n11 / (np1 + n1p) The dice coefficient depends on the frequency of the events occurring together and their individual frequencies. 14 True Mutual Information Word 2 ! Word 2 Word 1 n11 n12 n1p ! Word 1 n21 n22 n2p np1 np2 npp TMI = (nij / npp) * ∑ log( nij / mij) True Mutual Information measures to what extent the observed frequencies differ from the expected. 15 Spelling Correction Using context sensitive information through the bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled word. Given: First content word prior to the misspelled word First content word after the misspelled word List of possible spelling corrections 16 Spelling Correction Example Example Sentence: List of Possible corrections: artic aortic Statistical Analysis : her Her dobutamine stress echo showed mild aurtic stenosis with a subaortic gradient. Basic Idea dobutamine stress echo showed mild POS stenosis with subaortic gradient 17 Spelling Correction Statistics Possible 1 : Possible 2: mild artic 0.40 mild aortic 0.66 artic stenosis 0.03 aortic stenosis 0.30 Weighted average 0.215 Weighted average 0.46 • This allows us to take into consideration finding a bigram with word prior to the misspelling and after the misspelling • The possible word with its score are then returned 18 Types of Results Types of Results Gspell only Context sensitive only Hybrid of both Gspell and Context Taking the average of the Gspell and context sensitive scores Note : this turns into a backoff method when no statistical data is found for any of the possibilities Backoff method Use only the context sensitive score unless it does not exists then revert to the Gspell score 19 Preliminary Test Set Test set : partially scrubbed clinical notes Size : 854 words Number of misspellings : 82 Includes Abbreviations 20 Preliminary Results GSPELL Results : GSPELL Precision Recall 0.5357 0.7317 Fmeasure 0.6186 Context Sensitive Results: Measure of Association Precision Recall Fmeasure PHI 0.6161 0.8415 0.7113 LL 0.6071 0.8293 0.7010 TMI 0.6071 0.8293 0.7010 ODDS 0.6071 0.8293 0.7010 X2 0.6161 0.8415 0.7113 TSCORE 0.5625 0.7683 0.6495 DICE 0.6339 0.8659 0.7320 21 Preliminary Results Hybrid Method Results: Measure of association Precision Recall Fmeasure PHI 0.6607 0.9024 0.7629 LL 0.6339 0.8659 0.732 TMI 0.6607 0.9024 0.7629 ODDS 0.6250 0.8537 0.7216 X2 0.6339 0.8659 0.732 TSCORE 0.6071 0.8293 0.701 DICE 0.6696 0.9146 0.7732 22 Notes on Log Likelihood Log Likelihood is used quite often with context sensitive spelling correction Problem with large sample sizes The marginal values are very large due to the sample size Increases the expected values so the actually values are commonly so much lower than the expected values Very independent and very dependent ngrams end up with the same value Noticed similar characteristics with true mutual information 23 Example of Problem hip ! hip follow n11 88951 ! follow 65729 65740 88962 69783140 69848869 69872091 69937831 n11 Log Likelihood 11 145.3647 190 143.4268 86 0.09864 24 Conclusions with Preliminary Results Dice coefficient returns the best results Phi coefficient returns the second best Log Likelihood and True Mutual Information should not be used Need to now test the program with a more extensive test bed which is in the process of being created 25 NGram Statistics for Multi Term Identification Can not use previous statistics package Memory constraints due to the amount of data Would like to look for longer ngrams Alternative : Suffix Arrays (Church and Yamamoto) Reduces the amount of memory Two Arrays Two Stacks Contains the corpus Contains identifiers to the ngrams in the corpus Contains the longest common prefix Contains the document frequency Allows for ngrams up to the size of the corpus to be found 26 Suffix Arrays To be or not to be to be or not to be to be or not to be be or not to be • Each array element is considered a suffix or not to be • A Ngram is from a suffix until the end of the array not to be to be be 27 Suffix Arrays [0] [1] [2] [3] [4] [5] to be or not to be be or not to be or not to be not to be to be be = = = = = = 5 1 3 2 4 0 => => => => => => be be or not to be not to be or not to be to be to be or not to be Actual Suffix Array : 5 1 3 2 4 0 28 Term Frequency Term frequency (tf) is the number of times a ngram occurs in the corpus To determine the tf of an ngram: Sorted the suffix array tf = j – i + 1 j = first occurrence i = last occurrence [0] [1] [2] [3] [4] [5] = = = = = = 5 1 3 2 4 0 => => => => => => be be or not to be not to be or not to be to be to be or not to be 29 Measures of Association Residual Inverse Document Frequency (RIDF) RIDF = - log (df / D) + log(1 – exp(-tf/D) ) Compares the distribution of a term over documents to what would be expected by a random term Mutual Information (MI) MI(xYz) = log tf( xYz ) * tf( Y ) tf( xY) * tf( Yz ) Compares the frequency of the whole to the frequency of the parts 30 Present Work Calculated the MI and RIDF for the clinical notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DX Retrieved the respective text for each heading Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the data under each sections Noticed that different multi terms appear for each of the different sections 31 Conclusions Ngram statistics can be applied directly and indirectly to various problems Directly Spelling correction Compound word identification Term extraction Name identification Indirectly Part of Speech tagging Information Retrieval Data Mining 32 Packages Two Statistical Packages Contingency Table approach Measures for bigrams Measures for trigrams Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficient Log Likelihood and True Mutual Information Suffix Array approach Measures for all lengths of ngrams Residual Inverse Document Frequency and Mutual Information 33