Mining for Emerging Technologies Within Text Streams and Documents Dave Engel (dave.engel@pnl.gov) Paul Whitney Gus Calapristi Fred Brockman Text Mining Workshop (SDM09) Outline Research objective Surprising event detection (poster session) Emerging technologies detection Analysis Future research 2 Text Mining Workshop (SDM09) Objective “Our research and development are targeted at developing algorithms to find and characterize changes in topic (technologies) within text (streams and documents)” 3 Surprise Event Detection Topical Feature Extraction Collection of Text/Data Document Frequencies cov 12 sars 16 california 12 st 11 coronavirus 18 ny 8 ca 16 pa 8 jcv 12 sa 6 syndrome 25 wu 6 tt 6 antiretroviral 32 respiratory 37 je 9 acute 45 ttv 6 rd 5 sclerosis 36 02/00 11/02 08/05 05/08 temporal profiles, max Emergence, bin.width=1 month, # bins=(12, 18), Virology dataset Related Terms/Topics cov Identified Surprising Events sclerosis | ms | 36 31 | varicella sars leukoencephalopathy coronavirus | 22 xj jin | discuss phst | 73 respiratory res | 22 summit | 15 recommendations | 12 acute nodules sapovirus silencing | vzv | coronaviruses bv sirna prp 12 | jc | 9 27 | 02/00 14 11/02 08/05 05/08 temporal profiles, max Surprise, bin.width=1 month, # bins=12, Virology dataset 4 | G 9 10 10 Gaussian Model xt xt s (1 1 / n) = xt N t xt m x t N t m x t 9 11 | 16 | pml | european n 21 n 22 9 | provide 17 cov n11 n12 xi | bipartisan syndrome imaging Chi-Square (Pearson) Model i 16 | zoster Surprise Algorithms 23 xt = # documents containing term Nt = # documents within time interval n n.. | n11n22 n12 n21 | Y .. 2 2 n1. n2. n.1n.2 where n1. n11 n12 n2. n21 n22 n.1 n11 n21 n.2 n12 n22 n.. n11 n12 n21 n22 Text Mining Analysis 2 Surprise Modeling Scheme Surprising/Emerging Events Modeling Scheme Point Discontinuity i xj j in | bipartisan | discuss Jump Discontinuity summit | recommendations european Slope Discontinuity 5 | 9 9 | provide xi 9 11 | 10 10 Surprise Algorithms Chi-Square (Pearson) Model i xj j in | discuss | | | 9 10 10 Gaussian Model G xt x t s (1 1 / n ) = xt mxt N t mxt xt = # documents containing term Nt = # documents within time interval n n.. | n11 n22 n12 n21 | Y .. 2 2 n1. n2. n.1 n.2 where n1. n11 n12 n2. n21 n22 n.1 n11 n21 n.2 n12 n22 n.. n11 n12 n 21 n22 6 N t xt 9 11 recommendations european n21 n22 9 | provide summit xi | bipartisan n11 n12 2 Surprise GUI Analysis Tools 834 Documents 40 VofA 8 rights co 2 0 0 11/23 11/28 12/03 12/08 12/13 0 12/18 | rights 1000 11 | co forces 500 10 | 11 | top 11 services | robert | 9 9 baker | 9 combat | 9 held | 9 | council 10 iraqi | 9 bipartisan | 9 | discuss 9 | provide | summit recommendations 9 11 | | european 10 10 hamilton | 8 lee | 8 following | 8 11/23 11/28 12/03 12/08 12/13 12/18 temporal profiles, max Surprise, bin.width=12 hours, # bins=6, VofA dataset cov sars coronavirus syndrome respiratory acute sapovirus silencing coronaviruses sirna 1500 Record Gaussian, surprise.window = 6, bin.size = 12 hours Time Time Interval = 12 hours 7 4 Surprise Stat 20 10 # Docs 6 30 forces, top, services robert baker, combat, held, council, iraqi, bipartisan, discuss provide summit, recommendations, european, hamilton, lee, following expressed, opposition, press troops, world, administration, syria, east, file human, control, police, efforts comes, intelligence, issues, nbsp, hopes situation, united, law, policy, changes, agree, remains presidential, organization, bush, border, west, speech, led, north, late, regional african, progress, direct, republican, protect, defense, current, officials, court, james, cri nato, mission, british, talk, violent, amid, ally, remain, foreign, house, studio, terrorism, military, israel, capital, leaders, real, parliament, committee, army 2000 Surprise GUI Emerging Technologies Detection Process Flow Emerging Emerging Technologies Technologies Identified Identified Gather multiple data sources/types Source Data Selection 1 Domain Expert Review & Evaluation 4 Iterate, Enhance Refine Emergence Algorithm Development 3 Topical Feature Extraction 2 8 Biologist and analyst review to validate findings from emergence algorithms. Process with IN-SPIRE to evaluate suitability of the data content, find topics for emergence measurement, prep for Surprise! analysis Event detection algorithms modified and enhanced per biologists feedback Emergence Modeling Scheme Emerging Events Modeling Scheme i1 nx y j xj j i n Jump Discontinuity j i y kaposi | | 14 | norovirus | sirnas regimen Slope Discontinuity | outbreak polymorphisms 9 | 8 | 9 | 12 | | 12 | 10 Emergence Algorithms Chi-Square (Pearson) Model i1 nx j i ny j i y j xj kaposi | | 14 | norovirus | | sirnas regimen n11 n12 | outbreak 9 | 12 | | 12 | polymorphisms Gaussian Model 10 = N t mxt ny t N t ny t 8 | x y t G t sx s y nx n y n21 n22 mxt 10 xt = # documents containing term Nt = # documents within time interval n n.. | n11 n22 n12 n21 | Y .. 2 2 n1. n2. n.1 n.2 where n1. n11 n12 n2. n21 n22 n.1 n11 n21 n.2 n12 n22 n.. n11 n12 n21 n22 2 Sensitivity Analysis 600 J. Virology Two data (text) sources 300 Two Emergence algorithms 200 Chi-Square (Pearson) method Gaussian Bin sizes 400 # Docs 500 Journal of Virology BioTechniques journal 40,306 Documents BioTechniques Time interval Previous window Current Window 02/00 0 20 40 # Docs 60 80 5,693 Documents 11 11/91 05/97 11/02 Time Time Interval = 2 months 05/08 11/02 08/05 Time Time Interval = 1 month 05/08 Emergence Analysis Results Virology Dataset Top 30 Emergent Terms | | cov sars Sorted Emergence Scores | coronavirus 60 tt | | | antiretroviral | | | respiratory | | 50 res | acute ttv | 40 coronavirus | patients| | therapy| 30 | | r5 20 | chronic | tt, antiretroviral, respiratory, res acute, ttv sclerosis, patients, therapy drug, chronic, r5, sirna x4, h5n1, ms, immunodeficiency, transmission haart, gfp, influenza, kda ribavirin, trim5alpha, activation, lymphocytes, utr, load, subjects, herpesvirus, samples, m rnai, wnv, promoter, proteasome, lamivudine, prevalence, killer, kaposi, norovirus, sirnas liver, apobec3, cns, synthesis, fusion, interferon, plant, binding, baseline, porcine, dna, children, rabies, h1n1, huh7, receptor, south, nucleoside, women, helper, jc, kappa, de africa, chicken 10 | | drug syndrome | | sirna x4 | | | | h5n1 ms| | | transmission | haart gfp| | | | | influenza | 500 1000 1500 2000 | Record Chi−Square, Pearson, emerge.window = (12, 18), bin.size = 1 month 12 | | trim5alpha activation | lymphocytes | | ribavirin 0 | | immunodeficiency | kda | | | sclerosis | 0 Emergence Stat | | syndrome cov, sars | | | | | | 02/00 11/02 08/05 temporal profiles, max Emergence stat, bin.width = 1 month, # bins = (12, 18), PubMedViro dataset 05/08 BioTechniques Results Comparison to Domain Expert Predictions Top 60 Emergent Terms (sorted over time) polymerase chain reaction | | chain | | reaction | | concentration | | pcr | | isolation | | polymerase | | purification | | gel | rapid | rt | reproducible | dna assay mammalian accurate size green human gfp activity hybridization simple system method fragments genetic amount transfection gene 11/91 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 05/97 | developed analysis sensitive signal microarray tissue discovery methods amounts applications array microarrays throughput genomic improved produced fluorescent escherichia biology independent strategy therapeutic protein drug technique quantitative fluorescence cells containing demonstrated | 11/02 05/08 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 11/91 05/97 11/02 1990 polymerase chain reaction 1997 green flourescent protein 1994 reverse transcriptase 2000-2002 microarrays and high-throughput platforms 13 | | | | | | 05/08 Sensitivity Analysis Results Qualitative Comparison Common terms across sensitivity analysis Algorithm/parameter Virology BioTech Algorithm (Surprise, Emergence) 20, 12 7, 16 Bin size (Surprise, Emergence) 18, 24 4, 24 Previous Window (Surprise, Emergence) 18, 9 8, 2 Current Window (Emergence) 10 9 Table shows results in the comparison of top terms from each analysis Results from the Virology dataset tended to be more consistent Results typically were more consistent when the time intervals were largest Results from using the Surprise algorithms (compared to Emergence) were typically more consistent 14 Current Research Identify possible changing (morphing) terms utilizing multi-term keywords and decoupling temporal profiles polymerase chain |reaction | chain | | reaction | | | polymerase pcr | | | Identify changes in tone and affect Investigate Temporal correlation vs. Topical correlation 15 Decoupling Terms Related Terms bush Decoupling issue george bush | | 18 president bush president bush | | 8 policy president united states | | 26 bush − president bush 13 president − president bush 22 | president − bush | 17 foreign minister countries describe washington 06/01 president 08/01 10/01 11/01 01/02 temporal profiles, max Emergence stat, bin.width = 1 week, # bins = (4, 4), MPQA dataset foreign united states country united union Related Terms national washington bush countries | president − bush | presidential power supporters president − president bush president bush opposition bush violence iraq elected words elections axis democratic evil policy korea united states 16 minister foreign 03/02 04/02 Identifying Changes in Tone and Affect All Terms 38 | coup |9 Pos Affect Pos Affect 31 Neg Affect Neg Affect 31 | zimbabwe | | | 21 robert Pos Affect 48 Pos Affect 40 Neg Affect 48 Neg Affect 40 21 election Pos Affect 45 Pos Affect 64 Neg Affect 45 Neg Affect 64 11 tsvangirai Pos Affect 52 Pos Affect 35 Neg Affect 52 Neg Affect 35 | mugabe | korea 06/01 08/01 10/01 11/01 01/02 | | 03/02 04/02 temporal and affect profiles, max Emergence,bin.width = 1 week, # bins = (4,4), MPQA dataset 06/01 | 10/01 11/01 01/02 21 | 03/02 temporal and affect profiles, max Emergence,bin.width = 1 week, # bins = (4,4), MPQA dataset --- specific term affect --- all document affect 17 | | 08/01 17 16 04/02 Conclusions Modified previous Surprise event detection technology for detecting emerging technologies (trends) Utilized domain expertise within the iterative development process and analyses Performed several analyses, include sensitivity analysis Results were confirmed by domain experts as actual emerging technologies Continued development (current research) Multi-term keywords and decoupling temporal profiles Identify changes in tone and affect Temporal correlation vs. topical correlation 18