Slides - Jovian Lin

advertisement
TopicTrend
Discover Emerging and Novel Research Topics
By: Jovian Lin
Introduction
Formulating a research idea is the 1st step for success in
academia.
A worthy research idea must be original and innovative.
In order to come up with innovative research ideas,
researchers have to read a lot of published articles…
… which is time-consuming.
“Is there any shortcut to success?”
“No.”
“There are efficient ways to achieve success”
Search Engines in Digital Libraries:
Introduction
Search engines support information seeking and retrieval.
List of titles (of articles)
“Search Query”
Search
Engine
Search
Results
How useful is this result to the
Introduction
junior researcher?
Search engines support information seeking and retrieval.
However, is this enough for the junior researcher?
FYP students
•
•
•
•
1st year PhD students
Define a research topic (from zero knowledge)
Help in survey
Identify emerging/new research areas to explore
Determine related topics
Problem Definition
Junior researchers want:
Understand research topics and trends.
Recognize HOT topics.
Understand how topics interact and influence research activity.
Problem Definition
Junior researchers want:
Understand research topics and trends.
Recognize HOT topics.
Understand how topics interact and influence research activity.
Current
Inefficient
Method
Enter a
search query
Extract new terms from
selected article
Select a few
articles to read
View results
Search
Results
Information
overload !
Problem Definition
Junior researchers want:
Understand research topics and trends.
Recognize HOT topics.
Understand how topics interact and influence research activity.
Current
Inefficient
Method
Enter a
search query
Extract new terms from
selected article
Select a few
articles to read
View results
Problem Definition
Junior researchers want:
Understand research topics and trends.
Recognize HOT topics.
Understand how topics interact and influence research activity.
Desired
Efficient
Method
Enter a
search query
View results
TopicTrend
Do it quick!
List of HOT research topics
(related to the search query)
Visualization
of the research topics
Quick Demo
Evaluation
Recruited 4 participants.
•
•
•
•
Chemistry / PhD
Engineering (Transportation) / PhD
Comp Science (AI) / PhD
Engineering / FYP
Participants:
Tested TopicTrend using queries from their respective domains.
Rated TopicTrend’s output (w.r.t. their query). [Quantitative]
Filled up a questionnaire. [Qualitative]
Evaluation
“machine learning”
Topic H
Topic I
Topic G
Topic J
Topic F
Topic A
Topic B
Topic A
1
Topic B
0
Topic C
1
Topic D
1
Topic E
1
Topic F
1
Topic G
1
Topic H
1
Topic I
1
Topic J
1
Topic E
Score
Topic C
Topic D
9/10
Quantitative
Evaluation
Average score = 68.125%
Qualitative
Evaluation
Questionaire using Five-Point Likert Scale.
1=Disagree, 5 =Agree.
Some examples:
“The system was easy to use.” 4.75 / 5
“The system gave interesting results.” 4 / 5
“I was able to get a better understanding of the topics.” 4 / 5
“I was able to discover trends.” 4 / 5
“I was able to discover relationships between topics.” 4 / 5
“I was able to discover potential, novel topics.” 4 / 5
Details in Project Report.
Conclusion
TopicTrend is a visualization tool that helps junior researchers:
Understand research topics and trends.
Recognize HOT topics.
Understand how topics interact and influence research activity.
However, results were mediocre 
Due to presence of stop phrases (e.g., “problem set”, “proposed model”, etc)
Solutions and Future Work:
TF-IDF weight — don’t have to manually enter stop words.
Statistical measure to evaluate how important a word is.
The importance increases to the number of times a word appears in the document...
But is offset by the frequency of the word in the corpus.
Latent Dirichlet Allocation (LDA) – view each abstract as a mixture of topics. (David Blei)
Online LDA – find topics faster than normal LDA; analyze in a stream.
Dynamic Topic Models (DTM) – captures the word evolution of each topic over time.
Search by exemplar (instead of search by keyword)
Benefits users who have difficulty expressing their query.
Conclusion
TopicTrend is a visualization tool that helps junior researchers:
Understand research topics and trends.
Recognize HOT topics.
Understand how topics interact and influence research activity.
However, results were mediocre 
Due to presence of stop phrases (e.g., “problem set”, “proposed model”, etc)
Solutions and Future Work:
TF-IDF weight — don’t have to manually enter stop words.
Statistical measure to evaluate how important a word is.
The importance increases to the number of times a word appears in the document...
But is offset by the frequency of the word in the corpus.
Latent Dirichlet Allocation (LDA) – view each abstract as a mixture of topics. (David Blei)
Online LDA – find topics faster than normal LDA; analyze in a stream.
Dynamic Topic Models (DTM) – captures the word evolution of each topic over time.
Search by exemplar (instead of search by keyword)
Benefits users who have difficulty expressing their query.
Thank You
Backup Slides
Implementation
OpenNLP — a machine learning based toolkit for the
processing of natural language text.
Used OpenNLP to retrieve a list of NPs.
NP A
OpenNLP
Tools
An article
NP B
NP C
NP D
NP E
NP F
1.
2.
3.
4.
Sentence Detection
Tokenization
Part-of-Speech (POS) Tagging
Chunking and Retrieving NPs
Implementation
Sentence Detection
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr.
Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55
years old and former chairman of Consolidated Gold Fields PLC, was named a
director of this British industrial conglomerate. Those contraction-less sentences don't
have boundary/odd cases...this one does.
•
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
•
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
•
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields
PLC, was named a director of this British industrial conglomerate.
•
Those contraction-less sentences don't have boundary/odd cases...this one does.
Implementation
Tokenization
•
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
•
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
•
[Pierre] [Vinken] [,] [61] [years] [old] [,] [will] [join] [the] [board] [as] [a]
[nonexecutive] [director] [Nov.] [29] [.]
•
[Mr.] [Vinken] [is] [chairman] [of] [Elsevier] [N.V.] [,] [the] [Dutch] [publishing]
[group] [.]
Implementation
Part-of-Speech Tagging
•
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
•
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
•
[NNP] [NNP] [,] [CD] [NNS] [JJ] [,] [MD] [VB] [DT] [NN] [IN] [DT] [JJ] [NN]
[NNP] [CD] [.]
•
[NNP] [NNP] [VBZ] [NN] [IN] [NNP] [NNP] [,] [DT] [JJ] [NN] [NN] [.]
Implementation
Text Chunking and Extracting NPs
Text chunking consists of dividing a text in syntactically correlated parts of
words.
Uses the Tokenization and POS Tagging data.
For example:
He reckons the current account deficit will narrow to only # 1.8 billion in
September.
Becomes:
[NP He ] [VP reckons ] [NP the current account deficit ]
[VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ]
[NP September ] .
Implementation
Text Chunking and Extracting NPs
Text chunking consists of dividing a text in syntactically correlated parts of
words.
Uses the Tokenization and POS Tagging data.
Note the:
• B-Chunk
• I-Chunk
Implementation
OpenNLP — a machine learning based toolkit for the
processing of natural language text.
Used OpenNLP to retrieve a list of NPs.
NP A
OpenNLP
Tools
An article
NP B
NP C
NP D
NP E
NP F
1.
2.
3.
4.
Sentence Detection
Tokenization
Part-of-Speech (POS) Tagging
Chunking and Retrieving NPs
Implementation
An algorithm to calculate the score of a NP.
# (0 ~ 2 years)
10
NP A
# (2 ~ 4 years)
2
NP B
# (4 yrs & beyond) 1
Score =
=
NP C
10 + 1
10 + 2 + 1 + 20
11
33
NP D
NP E
# (0 ~ 2 years)
1
NP F
# (2 ~ 4 years)
2
# (4 yrs & beyond) 10
Score =
=
= 0.333
1+1
1 + 2 + 10 + 20
3
33
= 0.090
Implementation
An algorithm to calculate the score of a NP.
NP A
NP B
NP C
NP D
NP E
NP F
Implementation
Re-rank the list of NPs base on the score.
NP A
NP B
NP B
Re-rank
NP D
NP C
NP E
NP D
NP C
NP E
NP A
NP F
NP F
Implementation
Calculate the relationship strength between NPs by
considering the common articles (PIIs) that they have.
The more articles they have in common, the thicker the edge.
The End
Download