BACKGROUND MINING RESEARCH CYCLES WITH ADAPTED HIERARCHICAL CLUSTERING Dan He, D.S. Parker TMW 2011 Ag great deal of attention has been focused on understanding the cycle of news topics. (Kleinberg, 2003; J. Leskovec, 2009; ) The Th shifts hift off news ttopics i reveals l how h the th news evolved and how people’s attention shifted. We are interested in “research research cycles cycles”,, namely the ebb and flow, more specifically, come and go, of topics in the research literature. d h @ danhe@cs.ucla.edu l d BENEFITS OF IDENTIFYING RESEARCH CYCLES To study y the p pattern of research cycles y helps p to predict which topics are becoming popular or will become popular in the near future. It is helpful for funding agencies to grant awards. awards He & Parker [PAKDD 2011] studies topic trends in NIH’s RePORTER grant award database. Google trends is a system for identifying temporal patterns in topic histories, and cycles are important in evaluating its predictive capabilities. capabilities There seem many potential benefits in the area of analytics and predictive models for social media. WHAT ARE CYCLES? There are multiple p definitions of cycles. y As an example, stock prices are known to exhibit cyclical patterns. WHAT ARE CYCLES? WHAT ARE RESEARCH CYCLES? The canonical cycle of stock prices consists of four phases: accumulation, mark-up, distribution and mark-down. RESEARCH TOPICS VS. NEWS TOPICS RESEARCH TOPICS VS. NEWS TOPICS News Topics Research Topics Number of topics Large Small Time scale Minutes, days Months, years Trend of frequency Sharp bursts Flat bursts We will g give our definition of research cycles y later. Our model of cycle is actually not like a stock market k t cycle l – it is i a graph h cycle. l W We llook k ffor sequences of topics that recur, concluding a cycle. No previous work to quantify research cycles. News Topics Research Topics Number of topics Large Small Time scale Minutes, days Months, years Trend of frequency Sharp bursts Flat bursts RESEARCH TOPICS VS. NEWS TOPICS News Topics Research Topics Number of topics Large Small Time scale Minutes, days Months, years Trend of frequency Sharp bursts Flat bursts RESEARCH TOPIC OCCURRENCES Identify y Research topics p in p publications. Publications at different time indicate temporal occurrences of the corresponding research topics Publications Publications in Conferences Publications in Journals PUBLICATIONS IN CONFERENCES Papers p are organized g in sessions. Papers about related topics are assigned to the same session. PUBLICATIONS IN CONFERENCES Papers p are organized g in sessions. Papers about related topics are assigned to the same session. PUBLICATIONS IN CONFERENCES Papers p are organized g in sessions. Papers about related topics are assigned to the same session. The same session may recur multiple times. Session “Query Optimization” appeared in SIGMOD in years 1983 and 1985. the y PUBLICATIONS IN CONFERENCES Papers p are organized g in sessions. Papers about related topics are assigned to the same session. The same session may recur multiple times. A session is a natural cluster of a topic. Quantify Q tif ttopic-wise i i similarity: i il it Jaccard similarity: common terms of paper titles. Terms of the session sim(A,B) ( , ) SHIFTS OF TOPICS Two sessions are related if their similarity y exceeds a threshold t A research cycle for topic A: a path of occurrences over a sequence of n+1 consecutive years: A0 (m), (m) A1 (m 1), 1) A2 (m 2),..., 2) An (m n) Ai (m) is the i-th topic occurring at year m A0 An are the same topic p A0 (m), An (m n) are the same topic occurring in years m and m+n respectively sim(A i (Ai (m ( i), i) Ai 1 (m ( i 1)) t | T(A) T(B) | | T(A) T(B) | SHIFTS OF TOPICS Chain-of-shifts: We assume that research attention often shifts from one topic to another related topic. (social network graph mining/text mining) We can identify chains of topics between two occurrences of the same topic. E Example l off topic i shifts: hif Query Optimization (1983) – Distributed Query Processing (1984) – Query Optimization (1985) IDENTIFYING RESEARCH CYCLES We can build a g graph p G=(V,E) ( , ) in which each topic p in a year is a node, and each edge indicates the correlation between its pair of topics is greater than the threshold. threshold Since the same topic may occur in different years, the graph may contain multiple nodes for the same topic, but occurring in different years. IDENTIFYING RESEARCH CYCLES We can build a g graph p G=(V,E) ( , ) in which each topic p in a year is a node, and each edge indicates the correlation between its pair of topics is greater than the threshold. threshold Since the same topic may occur in different years, the graph may contain multiple nodes for the same topic, but occurring in different years. A(1) IDENTIFYING RESEARCH CYCLES We can build a g graph p G=(V,E) ( , ) in which each topic p in a year is a node, and each edge indicates the correlation between its pair of topics is greater than the threshold. threshold Since the same topic may occur in different years, the graph may contain multiple nodes for the same topic, but occurring in different years. We identify all paths between two nodes for the same topics in the graph. graph A(10 ( ) A(12 ( ) ALGORITHMS A naïve algorithm: g Breadth-first search from every node. Duplicate edges may be searched multiple times ALGORITHMS An improved p algorithm: g An improved p algorithm: g Depth-first search Once a path is found, trace back and record the path for each node on the path. path Next time a node which has been explored already on the path is to be explored, we can obtain the remaining i i portion ti off the th path th iimmediately. di t l Depth-first search ALGORITHMS ALGORITHMS An improved p algorithm: g Depth-first search Once a path is found, trace back and for each node record the path from the node to the destination. destination ALGORITHMS An improved p algorithm: g Depth-first search Once a path is found, trace back and record the path for each node on the path. path Next time a node on the path is to be explored, we Advantage: duplicate search of the can obtain avoid the remaining portion of the pathsame edges i immediately. di t l PUBLICATIONS IN JOURNALS NOT organized g into sessions. PUBLICATIONS IN JOURNALS NOT organized g into sessions. NO explicit sessions of topics. To identify topics, Clustering or Topic modeling: Hard to determine cluster/topic number – We want specific topics: the number of topics will be big In our data set (PNAS (Proceedings of the National Academy of Sciences) publications), there are around 70,000 papers over 25 years – The number of topics could be hundreds to thousands. AHC (ADAPTED HIERARCHICAL CLUSTERING) ALGORITHM Approximate pp the clustering g process p of conference publications – abstracts of the same year are organized into sessions Two T Phases Ph (B (Bottom-up): tt ) Phase 1: cluster the publications of the same year The number of publications is much smaller in a year. Much easier to experimentally determine a reasonable cluster number The clusters are considered as topics. AHC (ADAPTED HIERARCHICAL CLUSTERING) ALGORITHM Approximate pp the clustering g process p of conference publications – abstracts of the same year are organized into sessions Two T Phases Ph (B (Bottom-up): tt ) Phase 2: take the clusters from Phase 1 as sessions, conduct another clustering on the sessions Two sessions in the same year can not be in the same cluster. The sessions in the same cluster are considered as occurrences off the h same topic i in i different diff years IDENTIFYING RESEARCH CYCLES After the AHC algorithm g is conducted,, the same research cycle identification algorithm is applied. No session name is given, to represent the topic, f each for h cluster, l t select l t the th top-10 t 10 representative t ti terms (frequency/tf-idf) EXPERIMENTAL RESULTS Conference Publications: CONFERENCE PUBLICATIONS Identified 94 cycles y Paper titles for VLDB and SIGMOD from year 1975 to 2010. Classic example initially studied by Kleinberg et al. al Merge equivalently-named sessions from the two conferences in the same year. The number of unique sessions: around 1,128 99 sessions have multiple occurrences. Average number of occurrences is about 3.6 for these 99 sessions CONFERENCE PUBLICATIONS Identified 94 cycles y More “general” topics, manually l b l d labeled CONFERENCE PUBLICATIONS Identified 94 cycles y CONFERENCE PUBLICATIONS More “general” topics, manually l b l d labeled Identified 94 cycles Q Query O Optimization ti i ti (SIGMOD 1983) On the Design of a Query Processing Strategy in a Distributed Database Environment Query Processing Utilizing Dependencies and Horizontal Decomposition Estimating Block Transfers and Join Sizes Reflect the shifts of interests on different aspects of the same general topic Distributed Query Processing (VLDB 1984) p of Nested Q Queries in a Distributed Optimization Relational Database Processing Inequality Queries Based on Generalized Semi-Joins Optimizing Star Queries in a Distributed Database System CONFERENCE PUBLICATIONS Identified 94 cycles y Multiple cycles for the same topic i CONFERENCE PUBLICATIONS Identified 94 cycles y Multiple cycles for the same topic i Long cycles JOURNAL PUBLICATIONS 73,520 , abstracts in PNAS (Proceedings ( g of the National Academy of Sciences) from year 1985 to 2010. Conduct C d t AHC algorithm. l ith JOURNAL PUBLICATIONS Determine the cluster number: Phase 2: The cluster number is set to 400. The same as the conference publications, do not expect many sessions have multiple occurrences. 72 sessions have multiple occurrences. Ratio of cluster number, 400, over the number of multioccurrence sessions,, 72,, is similar to that of conference publications. JOURNAL PUBLICATIONS Determine the cluster number: Phase 1: The cluster number is set to 30. We want to have specific topics – The bigger the cluster number, the more specific the topics. Do not want to have too few publications in one session either. Since the number of jjournal publication p is much bigger, gg , we should expect much more number of publications in one session. 60-100 publications in one session REPRESENTATIVE TERMS OF RANDOM CLUSTERS BY AHC PHASE 1 Very y different representative p terms REPRESENTATIVE TERMS OF THE SAME CLUSTER BY AHC PHASE 2 Similar representative p terms SAMPLE CYCLES Identified 65 cycles y SAMPLE CYCLES Multiple cycles for the same topic “h “human cell” ll” Identified 65 cycles y SAMPLE CYCLES Identified 65 cycles y Multiple cycles for the same topic “h “human cell” ll” Multiple cycles for the same topic “gene expression” CONCLUSIONS & FUTURE WORK We p proposed p a model for research cycles. y We proposed an efficient algorithm to identify all possible research cycles. We proposed an adapted hierarchical clustering algorithm to model research cycles in journal publications. The cycles may not be necessarily true. We need to further develop metrics to validate them. The clustering method needs theoretical analysis. THANKS Q Questions? TWO OPEN QUESTIONS Q1: What drives research cycles? Q y Generality Data (more general topics tend to have cycles more frequently) (Social Network) Q2 What Q2: Wh t d drives i shifts hift off research h cycles? l ? Are there any significant pattern in the shifts? Long cycles, multiple cycles for the same topic, etc.