Mining for Emerging Technologies Within Text Streams and Documents

advertisement
Mining for Emerging Technologies
Within Text Streams and
Documents
Dave Engel
(dave.engel@pnl.gov)
Paul Whitney
Gus Calapristi
Fred Brockman
Text Mining Workshop (SDM09)
Outline
Research objective
Surprising event detection (poster session)
Emerging technologies detection
Analysis
Future research
2
Text Mining Workshop (SDM09)
Objective
“Our research and development are targeted at
developing algorithms to find and characterize
changes in topic (technologies) within text
(streams and documents)”
3
Surprise Event Detection
Topical Feature Extraction
Collection of Text/Data
Document Frequencies
cov
12
sars
16
california
12
st
11
coronavirus
18
ny
8
ca
16
pa
8
jcv
12
sa
6
syndrome
25
wu
6
tt
6
antiretroviral
32
respiratory
37
je
9
acute
45
ttv
6
rd
5
sclerosis
36
02/00
11/02
08/05
05/08
temporal profiles, max Emergence, bin.width=1 month, # bins=(12, 18), Virology dataset
Related Terms/Topics
cov
Identified Surprising Events
sclerosis
|
ms
|
36
31
|
varicella
sars
leukoencephalopathy
coronavirus
|
22
 xj
jin
|
discuss
phst
|
73
respiratory
res
|
22
summit
|
15
recommendations
|
12
acute
nodules
sapovirus
silencing
|
vzv
|
coronaviruses
bv
sirna
prp
12
|
jc
|
9
27
|
02/00
14
11/02
08/05
05/08
temporal profiles, max Surprise, bin.width=1 month, # bins=12, Virology dataset
4
|
G
9
10
10
Gaussian Model
xt  xt
s  (1  1 / n)
=
xt
N t  xt
m x t
N t  m x t
9
11
|
16
|
pml
|
european
n 21 n 22
9
|
provide
17
cov
n11 n12
xi
|
bipartisan
syndrome
imaging
Chi-Square (Pearson) Model
i
16
|
zoster
Surprise Algorithms
23
xt = # documents containing term
Nt = # documents within time interval
n
n..  | n11n22  n12 n21 | Y .. 
2
2

 
n1. n2. n.1n.2
where
n1.  n11  n12
n2.  n21  n22
n.1  n11  n21
n.2  n12  n22
n..  n11  n12  n21  n22
Text Mining Analysis
2
Surprise Modeling Scheme
Surprising/Emerging Events
Modeling Scheme
Point Discontinuity
i
 xj
j in
|
bipartisan
|
discuss
Jump Discontinuity
summit
|
recommendations
european
Slope Discontinuity
5
|
9
9
|
provide
xi
9
11
|
10
10
Surprise Algorithms
Chi-Square (Pearson) Model
i
 xj
j in
|
discuss
|
|
|
9
10
10
Gaussian Model
G
xt  x t
s  (1  1 / n )
=
xt
mxt N t  mxt
xt = # documents containing term
Nt = # documents within time interval
n
n..  | n11 n22  n12 n21 | Y .. 
2
2  
n1. n2. n.1 n.2
where
n1.  n11  n12
n2.  n21  n22
n.1  n11  n21
n.2  n12  n22
n..  n11  n12  n 21  n22
6
N t  xt
9
11
recommendations
european
n21 n22
9
|
provide
summit
xi
|
bipartisan
n11 n12
2
Surprise GUI
Analysis Tools
834 Documents
40
VofA
8
rights
co
2
0
0
11/23
11/28
12/03
12/08
12/13
0
12/18
|
rights
1000
11
|
co
forces
500
10
|
11
|
top
11
services
|
robert
|
9
9
baker
|
9
combat
|
9
held
|
9
|
council
10
iraqi
|
9
bipartisan
|
9
|
discuss
9
|
provide
|
summit
recommendations
9
11
|
|
european
10
10
hamilton
|
8
lee
|
8
following
|
8
11/23
11/28
12/03
12/08
12/13
12/18
temporal profiles, max Surprise, bin.width=12 hours, # bins=6, VofA dataset
cov
sars
coronavirus
syndrome
respiratory
acute
sapovirus
silencing
coronaviruses
sirna
1500
Record
Gaussian, surprise.window = 6, bin.size = 12 hours
Time
Time Interval = 12 hours
7
4
Surprise Stat
20
10
# Docs
6
30
forces, top, services
robert
baker, combat, held, council, iraqi, bipartisan, discuss
provide
summit, recommendations, european, hamilton, lee, following
expressed, opposition, press
troops, world, administration, syria, east, file
human, control, police, efforts
comes, intelligence, issues, nbsp, hopes
situation, united, law, policy, changes, agree, remains
presidential, organization, bush, border, west, speech, led, north, late, regional
african, progress, direct, republican, protect, defense, current, officials, court, james, cri
nato, mission, british, talk, violent, amid, ally, remain, foreign, house, studio, terrorism,
military, israel, capital, leaders, real, parliament, committee, army
2000
Surprise GUI
Emerging Technologies Detection
Process Flow
Emerging
Emerging
Technologies
Technologies
Identified
Identified
Gather multiple data
sources/types
Source
Data
Selection
1
Domain
Expert
Review &
Evaluation
4
Iterate,
Enhance
Refine
Emergence
Algorithm
Development
3
Topical
Feature
Extraction
2
8
Biologist and analyst
review to validate
findings from emergence
algorithms.
Process with IN-SPIRE to
evaluate suitability
of the data content, find
topics for emergence
measurement, prep for
Surprise! analysis
Event detection algorithms
modified and enhanced per
biologists feedback
Emergence Modeling Scheme
Emerging Events
Modeling Scheme
i1
nx
 y j  xj
j i n
Jump Discontinuity
j i
y
kaposi
|
|
14
|
norovirus
|
sirnas
regimen
Slope Discontinuity
|
outbreak
polymorphisms
9
|
8
|
9
|
12
|
|
12
|
10
Emergence Algorithms
Chi-Square (Pearson) Model
i1
nx
j i ny
j i
 y j  xj
kaposi
|
|
14
|
norovirus
|
|
sirnas
regimen
n11 n12
|
outbreak
9
|
12
|
|
12
|
polymorphisms
Gaussian Model
10
=
N t  mxt
ny t
N  t  ny  t
8
|
x  y t
G t
sx s y

nx n y
n21 n22
mxt
10
xt = # documents containing term
Nt = # documents within time interval
n
n..  | n11 n22  n12 n21 | Y .. 
2
2  
n1. n2. n.1 n.2
where
n1.  n11  n12
n2.  n21  n22
n.1  n11  n21
n.2  n12  n22
n..  n11  n12  n21  n22
2
Sensitivity Analysis
600
J. Virology
Two data (text) sources
300
Two Emergence algorithms
200
Chi-Square (Pearson) method
Gaussian
Bin sizes
400
# Docs
500
Journal of Virology
BioTechniques journal
40,306 Documents
BioTechniques
Time interval
Previous window
Current Window
02/00
0
20
40
# Docs
60
80
5,693 Documents
11
11/91
05/97
11/02
Time
Time Interval = 2 months
05/08
11/02
08/05
Time
Time Interval = 1 month
05/08
Emergence Analysis Results
Virology Dataset
Top 30 Emergent Terms
|
|
cov
sars
Sorted Emergence Scores
|
coronavirus
60
tt
|
|
|
antiretroviral
|
|
|
respiratory
|
|
50
res
|
acute
ttv |
40
coronavirus
|
patients|
|
therapy|
30
|
|
r5
20
|
chronic |
tt, antiretroviral, respiratory, res
acute, ttv
sclerosis, patients, therapy
drug, chronic, r5, sirna
x4, h5n1, ms, immunodeficiency, transmission
haart, gfp, influenza, kda
ribavirin, trim5alpha, activation, lymphocytes, utr, load, subjects, herpesvirus, samples, m
rnai, wnv, promoter, proteasome, lamivudine, prevalence, killer, kaposi, norovirus, sirnas
liver, apobec3, cns, synthesis, fusion, interferon, plant, binding, baseline, porcine, dna,
children, rabies, h1n1, huh7, receptor, south, nucleoside, women, helper, jc, kappa, de
africa, chicken
10
|
|
drug
syndrome
|
|
sirna
x4
|
|
|
|
h5n1
ms|
|
|
transmission
|
haart
gfp|
|
|
|
|
influenza
|
500
1000
1500
2000
|
Record
Chi−Square, Pearson, emerge.window = (12, 18), bin.size = 1 month
12
|
|
trim5alpha
activation
|
lymphocytes
|
|
ribavirin
0
|
|
immunodeficiency
|
kda
|
|
|
sclerosis
|
0
Emergence Stat
|
|
syndrome
cov, sars
|
|
|
|
|
|
02/00
11/02
08/05
temporal profiles, max Emergence stat, bin.width = 1 month, # bins = (12, 18), PubMedViro dataset
05/08
BioTechniques Results
Comparison to Domain Expert Predictions
Top 60 Emergent Terms (sorted over time)
polymerase
chain
reaction
|
|
chain
|
|
reaction
|
|
concentration
|
|
pcr
|
|
isolation
|
|
polymerase |
|
purification
|
|
gel
|
rapid
|
rt
|
reproducible
|
dna
assay
mammalian
accurate
size
green
human
gfp
activity
hybridization
simple
system
method
fragments
genetic
amount
transfection
gene
11/91
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
05/97
|
developed
analysis
sensitive
signal
microarray
tissue
discovery
methods
amounts
applications
array
microarrays
throughput
genomic
improved
produced
fluorescent
escherichia
biology
independent
strategy
therapeutic
protein
drug
technique
quantitative
fluorescence
cells
containing
demonstrated
|
11/02
05/08
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11/91
05/97
11/02
1990 polymerase chain reaction
1997 green flourescent protein
1994 reverse transcriptase
2000-2002 microarrays and high-throughput platforms
13
|
|
|
|
|
|
05/08
Sensitivity Analysis Results
Qualitative Comparison
Common terms across sensitivity analysis
Algorithm/parameter
Virology
BioTech
Algorithm
(Surprise, Emergence)
20, 12
7, 16
Bin size
(Surprise, Emergence)
18, 24
4, 24
Previous Window
(Surprise, Emergence)
18, 9
8, 2
Current Window
(Emergence)
10
9
Table shows results in the comparison of top terms from each analysis
Results from the Virology dataset tended to be more consistent
Results typically were more consistent when the time intervals were largest
Results from using the Surprise algorithms (compared to Emergence) were typically
more consistent
14
Current Research
Identify possible changing (morphing) terms
utilizing multi-term keywords and
decoupling temporal profiles
polymerase
chain |reaction
|
chain
|
|
reaction
|
|
|
polymerase
pcr
|
|
|
Identify changes in tone and affect
Investigate Temporal correlation vs. Topical correlation
15
Decoupling Terms
Related Terms
bush
Decoupling
issue
george
bush
|
|
18
president bush
president bush
|
|
8
policy
president
united states
|
|
26
bush − president bush
13
president − president bush
22
| president − bush |
17
foreign minister
countries
describe
washington
06/01
president
08/01
10/01
11/01
01/02
temporal profiles, max Emergence stat, bin.width = 1 week, # bins = (4, 4), MPQA dataset
foreign
united states
country
united
union
Related Terms
national
washington
bush
countries
| president − bush |
presidential
power
supporters
president − president bush
president bush
opposition
bush
violence
iraq
elected
words
elections
axis
democratic
evil
policy
korea
united states
16 minister
foreign
03/02
04/02
Identifying Changes in Tone and Affect
All Terms
38
|
coup
|9
Pos Affect
Pos Affect
31
Neg Affect
Neg Affect
31
|
zimbabwe
|
|
|
21
robert
Pos Affect
48
Pos Affect
40
Neg Affect
48
Neg Affect
40
21
election
Pos Affect
45
Pos Affect
64
Neg Affect
45
Neg Affect
64
11
tsvangirai
Pos Affect
52
Pos Affect
35
Neg Affect
52
Neg Affect
35
|
mugabe
|
korea
06/01
08/01
10/01
11/01
01/02
|
|
03/02
04/02
temporal and affect profiles, max Emergence,bin.width = 1 week, # bins = (4,4), MPQA dataset
06/01
|
10/01
11/01
01/02
21
|
03/02
temporal and affect profiles, max Emergence,bin.width = 1 week, # bins = (4,4), MPQA dataset
--- specific term affect
--- all document affect
17
|
|
08/01
17
16
04/02
Conclusions
Modified previous Surprise event detection technology for detecting
emerging technologies (trends)
Utilized domain expertise within the iterative development process
and analyses
Performed several analyses, include sensitivity analysis
Results were confirmed by domain experts as actual emerging
technologies
Continued development (current research)
Multi-term keywords and decoupling temporal profiles
Identify changes in tone and affect
Temporal correlation vs. topical correlation
18
Download