B M R C

advertisement
BACKGROUND
MINING RESEARCH CYCLES WITH
ADAPTED HIERARCHICAL
CLUSTERING
Dan He, D.S. Parker
TMW 2011
Ag
great deal of attention has been focused on
understanding the cycle of news topics.
(Kleinberg, 2003; J. Leskovec, 2009; )
 The
Th shifts
hift off news ttopics
i reveals
l how
h
the
th news
evolved and how people’s attention shifted.
 We are interested in “research
research cycles
cycles”,, namely
the ebb and flow, more specifically, come and go,
of topics in the research literature.

d h @
danhe@cs.ucla.edu
l d
BENEFITS OF IDENTIFYING RESEARCH
CYCLES
To study
y the p
pattern of research cycles
y
helps
p to
predict which topics are becoming popular or will
become popular in the near future. It is helpful for
funding agencies to grant awards.
awards He & Parker
[PAKDD 2011] studies topic trends in NIH’s
RePORTER grant award database.
 Google trends is a system for identifying temporal
patterns in topic histories, and cycles are
important in evaluating its predictive capabilities.
capabilities
 There seem many potential benefits in the area of
analytics and predictive models for social media.

WHAT ARE CYCLES?
There are multiple
p definitions of cycles.
y
 As an example, stock prices are known to exhibit
cyclical patterns.

WHAT ARE CYCLES?
WHAT ARE RESEARCH CYCLES?
The canonical cycle of stock prices consists of four phases:
accumulation, mark-up, distribution and mark-down.

RESEARCH TOPICS VS. NEWS TOPICS
RESEARCH TOPICS VS. NEWS TOPICS
News Topics
Research Topics
Number of topics
Large
Small
Time scale
Minutes, days
Months, years
Trend of frequency
Sharp bursts
Flat bursts
We will g
give our definition of research cycles
y
later.
 Our model of cycle is actually not like a stock
market
k t cycle
l – it is
i a graph
h cycle.
l W
We llook
k ffor
sequences of topics that recur, concluding a cycle.
 No previous work to quantify research cycles.
News Topics
Research Topics
Number of topics
Large
Small
Time scale
Minutes, days
Months, years
Trend of frequency
Sharp bursts
Flat bursts
RESEARCH TOPICS VS. NEWS TOPICS
News Topics
Research Topics
Number of topics
Large
Small
Time scale
Minutes, days
Months, years
Trend of frequency
Sharp bursts
Flat bursts
RESEARCH TOPIC OCCURRENCES
Identify
y Research topics
p in p
publications.
 Publications at different time indicate temporal
occurrences of the corresponding research topics
 Publications

Publications in Conferences
 Publications in Journals

PUBLICATIONS IN CONFERENCES
Papers
p
are organized
g
in sessions.
 Papers about related topics are assigned to the
same session.

PUBLICATIONS IN CONFERENCES
Papers
p
are organized
g
in sessions.
 Papers about related topics are assigned to the
same session.

PUBLICATIONS IN CONFERENCES
Papers
p
are organized
g
in sessions.
 Papers about related topics are assigned to the
same session.
 The same session may recur multiple times.


Session “Query Optimization” appeared in SIGMOD in
years 1983 and 1985.
the y
PUBLICATIONS IN CONFERENCES
Papers
p
are organized
g
in sessions.
 Papers about related topics are assigned to the
same session.
 The same session may recur multiple times.
 A session is a natural cluster of a topic.
 Quantify
Q
tif ttopic-wise
i
i similarity:
i il it


Jaccard similarity: common terms of paper titles.
Terms of the session
sim(A,B)
( , )
SHIFTS OF TOPICS
Two sessions are related if their similarity
y
exceeds a threshold t
 A research cycle for topic A:
a path of occurrences over a sequence of n+1
consecutive years:

A0 (m),
(m) A1 (m  1),
1) A2 (m  2),...,
2) An (m  n)
Ai (m) is the i-th topic occurring at year m
A0  An are the same topic
p
A0 (m), An (m  n) are the same topic occurring in
years m and m+n respectively
sim(A
i (Ai (m
(  i),
i) Ai 1 (m
(  i  1))  t
| T(A)  T(B) |
| T(A)  T(B) |
SHIFTS OF TOPICS

Chain-of-shifts:
We assume that research attention often shifts from
one topic to another related topic. (social network 
graph mining/text mining)
 We can identify chains of topics between two
occurrences of the same topic.


E
Example
l off topic
i shifts:
hif
Query Optimization (1983) – Distributed Query
Processing (1984) – Query Optimization (1985)
IDENTIFYING RESEARCH CYCLES
We can build a g
graph
p G=(V,E)
( , ) in which each topic
p
in a year is a node, and each edge indicates the
correlation between its pair of topics is greater
than the threshold.
threshold
 Since the same topic may occur in different years,
the graph may contain multiple nodes for the
same topic, but occurring in different years.

IDENTIFYING RESEARCH CYCLES
We can build a g
graph
p G=(V,E)
( , ) in which each topic
p
in a year is a node, and each edge indicates the
correlation between its pair of topics is greater
than the threshold.
threshold
 Since the same topic may occur in different years,
the graph may contain multiple nodes for the
same topic, but occurring in different years.

A(1)
IDENTIFYING RESEARCH CYCLES
We can build a g
graph
p G=(V,E)
( , ) in which each topic
p
in a year is a node, and each edge indicates the
correlation between its pair of topics is greater
than the threshold.
threshold
 Since the same topic may occur in different years,
the graph may contain multiple nodes for the
same topic, but occurring in different years.
 We identify all paths between two nodes for the
same topics in the graph.
graph

A(10
(
)
A(12
(
)
ALGORITHMS

A naïve algorithm:
g
Breadth-first search from
every node.

Duplicate edges may be searched multiple times
ALGORITHMS

An improved
p
algorithm:
g

An improved
p
algorithm:
g
Depth-first search
Once a path is found, trace back and record the path
for each node on the path.
path
 Next time a node which has been explored already on
the path is to be explored, we can obtain the
remaining
i i portion
ti off the
th path
th iimmediately.
di t l



Depth-first search
ALGORITHMS

ALGORITHMS
An improved
p
algorithm:
g
Depth-first search
 Once a path is found, trace back and for each node
record the path from the node to the destination.
destination

ALGORITHMS

An improved
p
algorithm:
g
Depth-first search
Once a path is found, trace back and record the path
for each node on the path.
path
 Next time a node on the path is to be explored, we
Advantage:
duplicate
search
of the
can obtain avoid
the remaining
portion
of the
pathsame edges
i
immediately.
di t l


PUBLICATIONS IN JOURNALS

NOT organized
g
into sessions.
PUBLICATIONS IN JOURNALS
NOT organized
g
into sessions.
 NO explicit sessions of topics.
 To identify topics, Clustering or Topic modeling:

Hard to determine cluster/topic number – We want
specific topics: the number of topics will be big
 In our data set (PNAS (Proceedings of the National
Academy of Sciences) publications), there are around
70,000 papers over 25 years – The number of topics
could be hundreds to thousands.

AHC (ADAPTED HIERARCHICAL
CLUSTERING) ALGORITHM
Approximate
pp
the clustering
g process
p
of conference
publications – abstracts of the same year are
organized into sessions
 Two
T
Phases
Ph
(B
(Bottom-up):
tt
)


Phase 1: cluster the publications of the same year
The number of publications is much smaller in a year.
 Much easier to experimentally determine a reasonable
cluster number
 The clusters are considered as topics.

AHC (ADAPTED HIERARCHICAL
CLUSTERING) ALGORITHM
Approximate
pp
the clustering
g process
p
of conference
publications – abstracts of the same year are
organized into sessions
 Two
T
Phases
Ph
(B
(Bottom-up):
tt
)


Phase 2: take the clusters from Phase 1 as sessions,
conduct another clustering on the sessions
Two sessions in the same year can not be in the same
cluster.
 The sessions in the same cluster are considered as
occurrences off the
h same topic
i in
i different
diff
years

IDENTIFYING RESEARCH CYCLES
After the AHC algorithm
g
is conducted,, the same
research cycle identification algorithm is applied.
 No session name is given, to represent the topic,
f each
for
h cluster,
l t select
l t the
th top-10
t 10 representative
t ti
terms (frequency/tf-idf)

EXPERIMENTAL RESULTS

Conference Publications:






CONFERENCE PUBLICATIONS

Identified 94 cycles
y
Paper titles for VLDB and SIGMOD from year 1975
to 2010.
Classic example initially studied by Kleinberg et al.
al
Merge equivalently-named sessions from the two
conferences in the same year.
The number of unique sessions: around 1,128
99 sessions have multiple occurrences.
Average number of occurrences is about 3.6 for these
99 sessions
CONFERENCE PUBLICATIONS

Identified 94 cycles
y
More “general”
topics, manually
l b l d
labeled
CONFERENCE PUBLICATIONS

Identified 94 cycles
y
CONFERENCE PUBLICATIONS
More “general”
topics, manually
l b l d
labeled

Identified 94 cycles
Q
Query
O
Optimization
ti i ti
(SIGMOD 1983)
On the Design of a Query Processing Strategy in a
Distributed Database Environment
Query Processing Utilizing Dependencies and
Horizontal Decomposition
Estimating Block Transfers and Join Sizes
Reflect the shifts of interests on different
aspects of the same general topic
Distributed Query Processing (VLDB 1984)
p
of Nested Q
Queries in a Distributed
Optimization
Relational Database
Processing Inequality Queries Based on Generalized
Semi-Joins
Optimizing Star Queries in a Distributed Database
System
CONFERENCE PUBLICATIONS

Identified 94 cycles
y
Multiple cycles
for the same
topic
i
CONFERENCE PUBLICATIONS

Identified 94 cycles
y
Multiple cycles
for the same
topic
i
Long cycles
JOURNAL PUBLICATIONS
73,520
,
abstracts in PNAS (Proceedings
(
g of the
National Academy of Sciences) from year 1985 to
2010.
 Conduct
C d t AHC algorithm.
l ith

JOURNAL PUBLICATIONS

Determine the cluster number:

Phase 2: The cluster number is set to 400.
The same as the conference publications, do not expect
many sessions have multiple occurrences.
 72 sessions have multiple occurrences.
 Ratio of cluster number, 400, over the number of multioccurrence sessions,, 72,, is similar to that of conference
publications.

JOURNAL PUBLICATIONS

Determine the cluster number:

Phase 1: The cluster number is set to 30.
We want to have specific topics – The bigger the cluster
number, the more specific the topics.
 Do not want to have too few publications in one session
either.
 Since the number of jjournal publication
p
is much bigger,
gg ,
we should expect much more number of publications in
one session.
 60-100 publications in one session

REPRESENTATIVE TERMS OF RANDOM
CLUSTERS BY AHC PHASE 1

Very
y different representative
p
terms
REPRESENTATIVE TERMS OF THE SAME
CLUSTER BY AHC PHASE 2

Similar representative
p
terms
SAMPLE CYCLES

Identified 65 cycles
y
SAMPLE CYCLES

Multiple cycles for
the same topic
“h
“human
cell”
ll”
Identified 65 cycles
y
SAMPLE CYCLES

Identified 65 cycles
y
Multiple cycles for
the same topic
“h
“human
cell”
ll”
Multiple cycles for
the same topic
“gene expression”
CONCLUSIONS & FUTURE WORK
We p
proposed
p
a model for research cycles.
y
 We proposed an efficient algorithm to identify all
possible research cycles.
 We proposed an adapted hierarchical clustering
algorithm to model research cycles in journal
publications.
 The cycles may not be necessarily true. We need
to further develop metrics to validate them.
 The clustering method needs theoretical analysis.

THANKS

Q
Questions?
TWO OPEN QUESTIONS

Q1: What drives research cycles?
Q
y
 Generality
 Data

(more general topics tend to have cycles more frequently)
(Social Network)
Q2 What
Q2:
Wh t d
drives
i
shifts
hift off research
h cycles?
l ?
 Are
there any significant pattern in the shifts?
 Long
cycles, multiple cycles for the same topic, etc.
Download