Preserving Semantic Content in Text Mining Using Multigrams Yasmin H. Said Department of Computational and Data Sciences George Mason University QMDNS 2010 - May 26, 2010 This is joint work with Edward J. Wegman Outline • Background on Text Mining • Bigrams – Term-Document and BigramDocument Matrices – Term-Term and DocumentDocument Associations • Example using 15,863 Documents To read between the lines is easier than to follow the text. -Henry James Text Data Mining • Synthesis of … – Information Retrieval • Focuses on retrieving documents from a fixed database • May be multimedia including text, images, video, audio – Natural Language Processing • Usually more challenging questions • Bag-of-words methods • Vector space models – Statistical Data Mining • Pattern recognition, classification, clustering Natural Language Processing • Key elements are: – Morphology (grammar of word forms) – Syntax (grammar of word combinations to form sentences) – Semantics (meaning of word or sentence) – Lexicon (vocabulary or set of words) • Time flies like an arrow – Time passes speedily like an arrow passes speedily or – Measure the speed of a fly like you would measure the speed of an arrow • Ambiguity of nouns and verbs • Ambiguity of meaning Text Mining Tasks • Text Classification – Assigning a document to one of several pre-specified classes • Text Clustering – Unsupervised learning • Text Summarization – Extracting a summary for a document – Based on syntax and semantics • Author Identification/Determination – Based on stylistics, syntax, and semantics • Automatic Translation – Based on morphology, syntax, semantics, and lexicon • Cross Corpus Discovery – Also known as Literature Based Discovery Preprocessing • Denoising – Means removing stopper words … words with little semantic meaning such as the, an, and, of, by, that and so on. – Stopper words may be context dependent, e.g. Theorem and Proof in a mathematics document • Stemming – Means removal suffixes, prefixes and infixes to root – An example: wake, waking, awake, woke wake Bigrams and Trigrams • A bigram is a word pair where the order of words is preserved. – The first word is the reference word. – The second is the neighbor word. • A trigram is a word triple where order is preserved. • Bigrams and trigrams are useful because they can capture semantic content. Example • Hell hath no fury like a woman scorned. • Denoised: Hell hath no fury like woman scorned. • Stemmed: Hell has no fury like woman scorn. • Bigrams: – Hell has, has no, no fury, fury like, like woman, woman scorn, scorn . – Note that the “.” (any sentence ending punctuation) is treated as a word Bigram Proximity Matrix . fury has hell like no scorn wom a n . fury 1 has 1 hell 1 like 1 no scorn wom a n 1 1 1 Bigram Proximity Matrix • The bigram proximity matrix (BPM) is computed for an entire document – Entries in the matrix may be either binary or a frequency count • The BPM is a mathematical representation of a document with some claim to capturing semantics – Because bigrams capture nounverb, adjective-noun, verbadverb, verb-subject structures – Martinez (2002) Vector Space Methods • The classic structure in vector space text mining methods is a term-document matrix where – Rows correspond to terms, columns correspond to documents, and – Entries may be binary or frequency counts • A simple and obvious generalization is a bigramdocument matrix where – Rows correspond to bigrams, columns to documents, and again entries are either binary or frequency counts Example Data • The text data were collected by the Linguistic Data Consortium in 1997 and were originally used in Martinez (2002) – The data consisted of 15,863 news reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995 – The full lexicon for the text database included 68,354 distinct words • In all 313 stopper words are removed • after denoising and stemming, there remain 45,021 words in the lexicon – The example that I report here is based on the full set of 15,863 documents. This is the same basic data set that Dr. Wegman reported on in his keynote talk although he considered a subset of 503 documents. Vector Space Methods • A document corpus we have worked with has 45,021 denoised and stemmed entries in its lexicon and 1,834,123 bigrams – Thus the TDM is 45,021 by 15,863 and the BDM is 1,834,123 by 15,863 – The term vector is 45,021 dimensional and the bigram vector is 1,834,123 dimensional – The BPM for each document is 1,834,123 by 1,834,123 and, of course, very sparse. Term-Document Matrix Analysis Zipf’s Law Term-Document Matrix Analysis serb bosniannato bosnia crash plane safe 41 air kim iraqi iraq famili give 365 effort 356 earlier went impact ground 55 43 close 340 planet telescop jupit fragment comet earth show perhap train handlong problem 157 situat pictur 45 move 152 bodi 67 502 part 363 seem 1 anyth water someth laterlittl 57 sure 60 79 77 133 oper big 80 76 81 home 53 nuclear remain fact 70 caus 36 realli 207 probabl side 5258 22 hitnight return277 cours 118120138 61 come 193 174 71 265 263 158 47 168 155 65 seen 278 499 28 37 abl pilot 173 27 492 497 bobbi 144 159 498 503 129 latest lot 5469 flood 154 world 38 150 493 125 182 washington 31 japan kobe sort 34 165 262181198 166 30 495 74 42 35 month 4846 kind 488 mean heard hall 494 491 area feel helicopt 122137 help 51 25 250 254 280202 490 40 39 rescu 167 501 64 23 quak 496 mile 170 south 68 128 56 194 130 33 123 274 chief indic 258 353 361 5 178 456 357 32 175 164 124 481 169 247180 151 131 245 489 179 minut 259 153 341 damag earthquak 141 349 354 162 meet releas 405 188 126 271 116 tell 333 369 135 132 455 127 saw 227 242 206 368 militari 140 253 483 few 13 404 500 251 467 350 337 thing 177 231 191 see 115 need 134north 270 232 261 366 117 119 161 352 start 273 275 66 happen 324 176korea 362 248 228 453 386 75 252 209 267 143 355 ago yesterdai thank hope 338 370 339 korean 466 359 center 256 260 continu even 160 stori 244 211 266 431 confirm208 276 good 323 204 268 344 336us 331 search 396 talk like well just back 156 346 189 272 join get done 439 485 237 241 464327 335 go here out hour 401391 269 187 try take dai american know 20 right still 426 look 332 383 142 todai 216 time think wai week point make place 322 402 first 474149 342 358 326 345 cnn 192 16nation 201 88 205279 334 be on morn 210 392 148 clinton 225 least 476 219 explos find 17 live came sai through 419 leader 14 393 report 226 238 163 fire far unit 367 presid expect 184 incid18 inform question 457 awai year 246190 peopl 330 249 two 397 number new left 257 447 want 459 offici 221 107 last 213 call 480 109 6 347437 hous made state told work 218 second believ 222 white 87 186 264199 govern said 220 364 12 475 458 299 114 185 86 possibl 195 turn john 282 384 196 shot 106 438 427 217 462 215 197 200 139 477 8 408 4 389 410 214 444 296 417 429sever 284 person injur 100 400 whether 183 five 203 15 230 298 239 233 293 411 418 283 385 292 ask 387 395 424 425 486 236 461 223 443 398 50 343 appar 10 288 290 415 463 319 409 offic 430 304 19 376 451 car 3 9 11 countri 303 289 235 issu7388 229 255 428 449 325 110 423 315 407 473 406 scene 460 381 478 448 470 310317 294 49 mr build 308 285 300 citi 24 320 394 147 104 454 concern 21 224306 servic 375 found 85 312 301 403 hear 487 89 108 44 311 471 open 399 102 59 309 101 defens secur depart 99377145 328 318 111 146 26 shoot polic 291 316446 374 84involv297 29 390 man bomb 62 121 94 465 105 321 445 against 98 372 feder 78 48293 investig 379 313 382 414 112 378 kill305 103 287 360 373 respons 90 4342 suspect 136 death 314 case 440212 469 92 91 82 422 413 420 37196 113 416 author 450 83 oklahomatest 479 95 433 307 302 484 286 421 evid 97 442 massachusett salvi 243 329 412 380 york 234 452 468 clinic fbi 436 472 281 charg 432 435 court dna abort 441 295 attornei blood judg simpson prosecut mission generword171 appear il forc deal astronom flight space 63 348 351 240 172 72 73 Text Example - Clusters • A portion of the hierarchical agglomerative tree for the clusters Text Example - Clusters Cluster 0, Size: 157, ISim: 0.142, ESim: 0.008 Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish 5.5%, fein 5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%, british 3.2%, adam 2.4% Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%, irish 3.5%, fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%, british 1.5%, adam 1.5% Phrases 1: ireland 121, northern 119, british 116, irish 111, ira 110, peac 107, minist 104, govern 104, polit 104, talk 102 Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94, republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71, peac.process 66, gerri.adam 59, british.govern 50 Phrases 3: irish.republican.armi 91, prime.minist.john 47, minist.john.major 43, ira.ceas.fire 35, ira.polit.wing 34, british.prime.minist 34, sinn.fein.leader 30, rule.northern.ireland 27, british.rule.northern 27, declar.ceas.fire 26 Text Example - Clusters Cluster 1, Size: 323, ISim: 0.128, ESim: 0.008 Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea 10.8%, kim 5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang 2.0%, south 1.9%, south.korea 1.5% Discriminating: korea 12.7%, north 7.4%, korean 7.2%, north.korea 7.0%, kim 3.8%, north.korean 2.4%, nuclear 1.7%, pyongyang 1.3%, south.korea 1.0%, simpson 0.8% Phrases 1: korea 305, north 303, korean 285, south 243, unit 215, nuclear 204, offici 196, pyongyang 179, presid 167, talk 165 Phrases 2: north.korea 291, north.korean 233, south.korea 204, south.korean 147, kim.sung 108, presid.kim 83, nuclear.program 79, kim.jong 74, light.water 71, presid.clinton 69 Phrases 3: light.water.reactor 56, unit.north.korea 55, north.korea.nuclear 53, chief.warrant.offic 49, presid.kim.sung 46, leader.kim.sung 39, presid.kim.sam 37, north.korean.offici 36, warrant.offic.bobbi 35, bobbi.wayn.hall 29 Text Example - Clusters Cluster 24, Size: 1788, ISim: 0.012, ESim: 0.007 Descriptive: school 2.2%, film 1.3%, children 1.2%, student 1.0%, percent 0.8%, compani 0.7%, kid 0.7%, peopl 0.7%, movi 0.7%, music 0.6% Discriminating: school 2.3%, simpson 1.8%, film 1.7%, student 1.1%, presid 1.0%, serb 0.9%, children 0.8%, clinton 0.8%, movi 0.8%, music 0.8% Phrases 1: cnn 1034, peopl 920, time 893, report 807, don 680, dai 650, look 630, call 588, live 535, lot 498 Phrases 2: littl.bit 99, lot.peopl 90, lo.angel 85, world.war 71, thank.join 67, million.dollar 60, 000.peopl 54, york.citi 50, garsten.cnn 48, san.francisco 47 Phrases 3: jeann.moo.cnn 41, cnn.entertain.new 36, cnn.jeann.moo 32, norma.quarl.cnn 30, cnn.norma.quarl 28, cnn.jeff.flock 28, jeff.flock.cnn 27, brian.cabel.cnn 26, pope.john.paul 25, lisa.price.cnn 25 Bigrams Bigrams Cluster 1 Cluster Size Distribution Document by Cluster Plot Cluster Identities • • • • • • • • • • Cluster 02: Comet Shoemaker Levy Crashing into Jupiter. Cluster 08: Oklahoma City Bombing. Cluster 11: Bosnian-Serb Conflict. Cluster 12: Court-Law, O.J. Simpson Case. Cluster 15: Cessna Plane Crashed onto South Lawn White House. Cluster 19: American Army Helicopter Emergency Landing in North Korea. Cluster 24: Death of North Korean Leader (Kim il Sung) and North Korea’s Nuclear Ambitions. Cluster 26: Shootings at Abortion Clinics in Boston. Cluster 28: Two Americans Detained in Iraq. Cluster 30: Earthquake that Hit Japan. Bigram-Document Matrix for 50 Documents Bigram-Bigram Matrix for 50 Documents Bigram-Bigram Matrix Using the Top 253 Bigrams Closing Remarks • Text mining presents great challenges, but is amenable to statistical/mathematical approaches – Text mining using vector space methods challenges both the mathematical and visualization issues • especially in terms of dimensionality, sparsity, and scalability. Acknowledgments • • • • Dr. Angel Martinez Dr. Jeff Solka and Avory Bryant Dr. Walid Sharabati Funding Sources – National Institute on Alcohol Abuse and Alcoholism (Grant Number F32AA015876) – Army Research Office (Contract W911NF-04-1-0447) – Army Research Laboratory (Contract W911NF-07-1-0059) – Isaac Newton Institute Contact Information Yasmin H. Said Department of Computational and Data Sciences Email: ysaid99@hotmail.com Phone: 301-538-7478 The length of this document defends it well against the risk of its being read. -Winston Churchill