Publication List Web Page?

advertisement
Mining Academic Community
Jan-Ming Ho
hohoiis.sinica.edu.tw
Computer System and Communication Lab
Institute of Information Science
Academia Sinica
What is Community?
 In Graph Theory
 densely connected groups of vertices,
with sparser connection between groups
 In Social Network Analysis
 groups of entities that share similar
properties or connect to each other via
certain relations
 A social network is a structure made
up of nodes, representing entities
from different conceptual groups,
that are linked with different types
of relations
2
Why is Community Important?
 Interesting data with community structure
 researcher collaboration, friendship network, WWW, Massive
Multi-player on-line gaming, electronic communications.
 Groups of web pages that link to more web pages in the
community than pages outside correspond to web pages
on related topics
 Groups in social networks correspond to social
communities, which can be used to understand
organizational structure, academic collaboration,
shared interests and affinities, etc.
3
Motivation
 Understand the research network between authors,
conferences and topics (rank entities by relevance for
given entities)
 Find and justifiably recommend research collaborators
for given authors
 Explore the academic social network
 Find out most important papers, researchers and venues
for a given topic
4
Related Systems
 Many digital library systems exist






ACM Digital Library
IEEExplorer
DBLP
Citeseer
Libra
DBConnect
 Problems
 The coverage of dataset is not large enough
 Name ambiguous problem exists in
 Web pages
 Citation records
5
Libra Academic Search




http://libra.msra.cn
Free computer science bibliography search engine
A test-bed for object-level vertical search research
Currently the following types of paper-related objects can
be searched:
 Papers, Authors, Conferences, Journals, Research Communities
6
7
8
DBconnect: Conference
9
DBconnect: Topic
10
DBconnect: Author
11
ZoomInfo
(1) People Directory
(2) Developer Tools
(3) Social Network, Profile Statistics, Employment History
(4) Ability to identify ambiguous?! Ex. Can get 21 different people called “Bing Liu”
12
ArnetMiner
13
Our goal
 Developing an automatic system to
 Explore the academic social network
 Find out most important papers, researchers and venues for a given
topic
 Provide solutions for existent problems
 Collecting larger citation datasets
 Retrieving data from web pages
• Publication list finder
• Extracting citation strings from web pages
• Citation parser
 Multilingual data sources
• Chinese and English corpuses
 Name dissemination mechanism in
 Web pages
 Citation records
14
Our contributions





Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee, and Jan-Ming Ho, "Web
Appearance Disambiguation of Personal Names Based on Network Motif," in the 2006
IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), Hong
Kong, Dec. 18-22, 2006
Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho, "PLF: A Publication List Web
Page Finder for Researchers," in Proceedings of the 2007 IEEE/WIC/ACM
International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 25, 2007
Kai-Hsiang Yang, Wei-Da Chen, Hahn-Ming Lee and Jan-Ming Ho, "Mining
Translations of Chinese Name from Web Corpora by Using Query Expansion
Technique and Support Vector Machine," in Proceedings of the 2007 IEEE/WIC/ACM
International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 25, 2007
Chia-Ching Chou, Kai-Hsiang Yang and Hahn-Ming Lee, "AEFS: Authoritative Expert
Finding System Based on a Language Model and Social Network Analysis," in
Proceedings of the 12th Conference on Artificial Intelligence and Applications
(TAAI2007), Nov 16-17, 2007
Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho, "BibPro: A Citation Parser
Based on Sequence Alignment Techniques," will appear in Proceedings of the IEEE
22nd International Conference on Advanced Information Networking and
Applications (AINA-08)
15
PLF: A Publication List Web Page
Finder for Researchers
16
Agenda




Introduction
Publication List Web Page Finder, PLF
Performance Evaluation
Conclusion, Future Work
17
Overview of a Publication List Web Page
 Keep abreast of state-of-the-art research
 Contains citations not found elsewhere.
 May provide some reference materials, such as slides and talks.
 Challenges
 How to find the publication list web pages

Only with the given name .
 Various versions or Multiple copies

An author may have many affiliations.
 Name ambiguity problem

E.g., Dr. Bing Liu, we found that 26 people share the same name by inquiring to ZoomInfo (people
search engine).
18
Problem
“Publication List Web Page?”
19
Definition of Publication List
Affiliated Personal Publication List Web Page (APPL)
a web page belongs to the affiliated web site of a specific person with the given name.
[Affiliation] Institute of Information
Science, Academia Sinica
citation string
20
Agenda




Introduction
Publication List Web Page Finder, PLF
Performance Evaluation
Conclusion, Future Work
21
Process Flow
Citation String Search
Web Page Crawler
Rank Function
collects the citation strings from digital
libraries
collects the hyperlinks of web pages from
search engines by using the collected
citation strings as queries
analyses the statistics of all the
collected hyperlinks of web
pages
web page
Analyse
Interaction
Search Engines
Digital Libraries
Citations statistics
Parsing
Query
hyperlinks
Given Names
22
Basic Concept
A publication list web page may
contain many citation strings
A publication list web page
QPT1
Query
Search Engine
Web Page1
WP2
WP3
Jan-Ming Ho
.
.
Paper Title1
.
WPn
QPT2
Query
Search Engine
Web Page1
WP2
WP3
.
.
PT2
PT3
.
X
.
.
PTm
.
WPn
23
Agenda




Introduction
Publication List Web Page Finder, PLF
Performance Evaluation
Conclusion, Future Work
24
Dataset
 Scenario
 Seminar members have usually published major research works
 We randomly collected 200 names from the WWW ’06
Conference Committee website
APPL Types
#APPL
#people
%population
others
0
22
11%
single-group
1
120
60%
multi-group
2
35
17.5%
3
16
8%
4
7
3.5%
25
Experiment Evaluation
 Evaluation metrics
 We consider the top-5 results derived by each link and focus on
the top-5 recall metric, which is calculated by:
Ra
recall 
R
Notation
Definition
Ra
the number of publication list web pages belonging
to researchers listed in the dataset
R
the number of publication list web pages contained
in the top-5 results
26
Parameter Analysis for Single-Group
(m, n)
(m, n)
(a) Fixed n mixed with different scale m
(b) Fixed m mixed with different scale n
Figure (a)
• When m increases, the recall rate also increases.
•Figure (b)
• System performance may be constrained by m.
27
Parameter Analysis for Multi-Group
(a) Fixed n mixed with different scale m
(b) Fixed m mixed with different scale n
Figure (a)
• It is clear that the performance when m = 40 is always better than the other settings.
Figure (b)
• The best performance (top-5 recall is 70%) occurs when n = 75.
28
Performance Evaluations
(given name + keyword)
(a)Performance of approaches in
(b)Performance of different ways in
single-group
multi-group
1.
The parameter m has a strong influence on the system’s performance; for example, an oversized
m may degrade the performance.
2.
The parameter n has little influence on the system’s performance.
3.
The PLF system outperforms the other two approaches on both the single-group and the
multi-group datasets.
29
Conclusion
 We have defined the problem of finding the publication
list web pages of a researcher, and proposed “PLF”
system
 Ongoing work
 Name ambiguity problem
 How to merge the multiple publication list web pages for a
specific person into a single page.
30
Discussion – Name Ambiguity Problem
 Scenario
 We take the name “Bing Liu”
as an example
 Analyze manually
 Observation
 Citation Count
 Name translation problem
 Partial matching problem
31
Extracting Citation Strings from Web
Pages
32
Extract Citation Records
Extract
Web Page
Structured
Data
33
Challenges
 The formats of publication list web pages vary
 There are no fixed syntactic rules for parsing citation
records
 Hence, We can not apply simple rules to extract citation records
automatically
34
Challenges:
Complex Layouts of Publication List Pages
35
Ideas
 The semantic structure of web pages is organized by
visual arrangement.
 We can utilize semi-structure information (visual ) of
web pages to help extraction task.
 With hierarchical structure and geometric information,
DOM tree is not only a great structure to present Web
pages, but also very helpful for visual pattern analysis.
36
DOM Tree Presentation of Web page
Banner
Citation String
Navigator
Bar
Citation String
Citation String
Citation String
Publication List
37
Architecture of Citation Extraction System
p
Publication List
Pages
li
Parsing
a
T1
li
T2
a
em
T3
T4
Common Style Finder
Mining
div Common Style
li
Patterns
T5
a
a T2
T6
T1
p
li
div
a
em
T3
T4
T5
a
T6
DOM Tree
Publication List
Web Page
Finder
Citation Extractor
Ranking
Records
Citation
Records
Candidate
Citation
Records T3T4T5
p
li
T1T2
Normal Citation
Model
Estimating
Citation
Length
Extracting a
Candidate T1
Records
T2
li
a
em
T3
T4
T5
Citation Extraction System
CiteSeer
38
Modules of Citation Extraction System
 Common Style Finder
 find out all common style patterns for each level of
granularity in web pages
 Citation Extractor
 explore data regions with common style patterns
 distill extraction rules from those data regions
 rank extraction patterns based on a normal word
count distribution probability
39
BibPro: A Citation Parser based on
Sequence Alignment Techniques
40
System Goal
Citation
Chomsky, Noam. 1956. Three models for the description of
language. IRE Transactions on Information Theory. 2(3) 113--124.
Our System
MetaData
Author: Chomsky, Noam
Title: Three models for the description of language
Journal: IRE Transactions on Information Theory
Volume: 2
Issue: 3
Page: 113-124
Month:
Year: 1956
41
Basic Idea(1/2)
 Encode citation to protein sequence
 Only keep the citation style information
 order of fields
 field separators
Author
protein
sequence
A
Title
D
T
Journal
D
L
year
page
…
…
D Y R P H S
42
Basic Idea(2/2)
 To determine citation style by the order of punctuation marks
and reserved words
System Preprocess
Online parsing
Citation Feature
Index
Style
Citation Feature
Index
Style
Citation
String
Search Tool
BLAST
Feature
Index
Citation Feature
Index
Style
.
.
.
.
43
How to encode citation to protein sequence?
 Keep the citation style information
 Which field should be included? (only can use 23 symbol)
 Which punctuation are used to separate fields?
 By observing different citation styles, we define an encode
table to translate each token of citation to an amino acid
symbol
44
Encode Table
A: Author
T: Title
L: Journal
F: Volumn value
W: Issue value
H: Page value
M: Month
Y: Year
X: noise (unrecognized token)
S: Issue key. e.g. “no”, “No”
P: Page key. e.g. “pp”, “page”
V: Volume key. e.g. “Vol”, “vo”
N: numeral
Q: @ # $ % ^ & * + = \ | ~ _ / ! ? 。
I: ( [ { < 「
K: ) ] } > 」
D: .
G: " “ ”
R: ,
C: - :
E: ' `
Z: ;
B: blank
45
How to using protein sequence to extract
metadata?
 Transform extraction problem to sequence alignment problem
 Form translation
 Unknown Answer
 BASE FORM
 ALIGN FORM
 INDEX FORM
 Known Answer
 RESULT FORM
 STYLE FORM
 INDEX FORM
46
RESULT FORM (Known Answer)
47
BASE FORM (Unknow Answer)
48
System Structure
 System PreProcess
(Template Generating System)
 Citation Crawler
 Template Builder
 Online Parsing
(Parsing System)
Resource on
the Internet
 Template Matching
 Metadata Extraction
Query Citation
1
2
System
PreProcess
Online
Parsing
Template
Database
Metadata
49
Citation Crawler
BibTex
BibTex
BibTex
Citation Crawler
Google
Engine
IEEE
Engine
ACM
Engine
CiteSeer
Engine
Citation
Citation
Citation
BibTex
Parser
MetaData
MetaData
MetaData
50
BLAST-powered Template Matching
Query
Citation
TEMPLATE DATABASE
Encode
Table
Encode
Citation
INDEX
FORM
Form Translation
INDEX FORM STYLE FORM
INDEX FORM STYLE FORM
INDEX FORM STYLE FORM
INDEX FORM STYLE FORM
INDEX FORM STYLE FORM
INDEX FORM STYLE FORM
BLAST
STYLE FORM
STYLE FORM
STYLE FORM
51
Evaluation for CiteSeer DataSet
 Consider the inconsistency between the Citation String
and BibTex file(metadata)
 Old Measurement:
Field Precision old 
# [Token parsed field  Token BibTex field ]
# [Token parsed field  Token BibTex ]
 New Measurement:
Field Precision new 
# [Token parsed field  Token BibTex field ]
# [Token query citation  Token BibTex ]
52
Definition
 Tokenparsedfield: denote tokens that appear in the parsed
subfield
 Tokenquery citation: denote tokens that appear in the query
citation string
 TokenBibTex field : denote tokens that appear in the specific
subfield in the BibTex file
 TokenBibTex : denote all tokens that appear in the BibTex
file
These tokens don' t include punctuation
53
Compare with ParaCite
 DataSet
 Collected from CiteSeer
 Training Set: 2416
 Testing Set: 4131
 ParaCite
 Using default template Database
• add template to its database isn’t easy
 Test Testing Set
 Our System
 Using training template Database (Training Set)
 Test Testing Set
54
Experimental Results
ParaCite
Autor
Title
Journal
Page
Issue
Year
Score
new Eva
32.90%
73.35%
29.83%
4.58%
25.05%
77.04%
50.22%
ParaCite
Autor
Title
Journal
Page
Issue
Year
Score
old Eva
99.08%
62.72%
30.46%
100.00%
93.96%
99.70%
78.81%
Our
Author
Title
Journal
Volumn
Page
Issue
Month
Year
Score
new Eva
93.73%
73.32%
51.34%
83.52%
94.62%
85.11%
89.18%
96.49%
84.80%
Our
Author
Title
Journal
Volumn
Page
Issue
Month
Year
Score
old Eva
90.58%
89.51%
67.66%
93.58%
96.69%
91.79%
99.49%
99.50%
91.45%
55
Analysis
 ParaCite only can extract one author name
 Old evaluation have a problem: it is highly probable that
you will obtain high accuracy, if you extract less
information
56
Evaluation for clean DataSet
 Ciation String is fully composed of corresponding
metadata
Number of correctly extracted fields
Accuracy 
Total number of fields
57
Compare with INFOMAP
 DataSet
 Includes 160000 record
 Training Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA, MISQ, and
ISR)
 Testing Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA, MISQ, and
ISR)
58
Result
Author
Title
Journal
Volumn
Page
Issue
Year
Overall average
APA
99.67%
96.38%
97.06%
98.99%
98.71%
98.12%
99.42%
98.33%
IEEE
98.72%
98.12%
99.12%
99.30%
98.40%
98.39%
99.40%
98.78%
ACM
97.14%
95.01%
93.93%
97.19%
97.92%
97.03%
98.88%
96.73%
ISR
99.48%
96.17%
96.96%
99.15%
98.55%
98.39%
99.35%
98.29%
MISQ
98.59%
97.99%
98.98%
99.41%
98.83%
98.61%
99.54%
98.85%
JMIS
91.95%
87.90%
90.46%
99.23%
98.76%
98.03%
99.46%
95.11%
Average
97.59%
95.26%
96.09%
98.88%
98.53%
98.09%
99.34%
97.68%
59
Evaluation for Cora DataSet
 500 records
 Be used as benchmark for many papers
(HMM, SVM, CRF)
60
Evaluation
 Divide words into four kinds:
 TP,FP,TN,FN
 Four metrics:




Word Accuracy: (TP+TN)/(TP+FP+FN+TN)
Precision: TP/(TP+FP)
Recall: TP/(TP+FN)
F1-measure: (2*Precision*Recall)/(Precision+Recall)
61
Our System
acc.
F1.
Author
97.17%
93.98%
Title
94.17%
90.13%
Journal
93.58%
83.27%
Volume
99.21%
84.62%
Page
99.21%
92.09%
Date
99.92%
98.96%
62
Mining Translations of Chinese Names from
Web Corpora by Using a Query Expansion
Technique and Support Vector Machine
63
Agenda
 Introduction
 Proposed Approach
 Experiments
 Conclusions and Future Work
64
Background
 Most of academic information can be
found on the Web
 Scholar Google, DBLP etc.
65
Problems in Searching Chinese Name
Only Chinese Corpus
66
Challenges in Chinese Name Translation

Many pronunciation rules in different areas


陳  Chen (Taiwan)
陳  Tsun (Hong Kong)
陳  Tan (Fukien)
Some additional words exist.

Ex: 黃光明 (Kwang-Ming Frank Hwang)
Ex: 張韻詩 (Jane Win-Shih Liu)
67
Common Chinese Name Translation Format
Name Format
Examples
Type-1. (Chinese given name) (Surname) or
(Surname), (Chinese given name)
劉豐哲 (Fon-Che Liu)
黃田漢 (Ng Tian Hann)
林牛 (Ngau Lam)
Type-2. (Merged Chinese given name)
(Surname)
吳德琪 (Derchyi Wu)
Type-3. (Western first name) (Surname)
趙蓮菊 (Anne Chao)
Type-4. (Chinese given name) (Western first
name) (Surname)
Type-5. (Abbreviated Chinese given name)
(Surname)
Type-6. (Western first name) (Abbreviated
Chinese given name) (Surname)
Type-7. (Chinese given name) (Abbreviated
Chinese given name) (Surname)
Type-8. (Chinese given name) (Unpredictable
Surname)
黃光明 (Kwang-Ming Frank
Hwang)
張秀瑜 (S.-Y. Chang)
李昭勝 (Jack-C. Lee)
蔡桂紅(Gwei-Hung H. Tsai)
張韻詩(Jane Win-Shih Liu)
68
Goal
 Design an automatic mechanism to
translate a given Chinese name into its
related English name
69
Agenda
 Introduction
 Proposed Approach
 Experiments
 Conclusions and Future Work
70
Concepts of Proposed Approach
No
corresponding
translations
71
Three Major Techniques
 Query expansion technique
 Translation of the surname
• Obtaining the related Web page snippets of the Chinese name translation.
• Solve the problem of the unrelated term existing in the name translation.
 Knowledge-based method
 Chinese surname database, A common dictionary, Western first name
database
• Obtaining all the name-like terms from the returned Web page snippets.
 SVM
 Chinese pronunciation database, the phonetic feature and the distant
feature, selectedatraining samples
• Selecting the appropriate Chinese name translations from the candidates.
72
System Architecture
Chinese
names
Query
expander
Chinese
surname
database
Returned
Web page
snippets
Name
candidates
Candidate
extractor
Western
first name
database
On-line
dictionary
SVM-based
name selector
Chinese
pronunciation
database
Translated
English
names
73
Query Expander
 Goal:
To retrieve Web page snippets that contain both a person’s Chinese
name and the translation of the person’s surname.

Name splitter
Determining whether the input Chinese name contains a compound surname
 Chinese surname database
 Dividing the input Chinese name into a “Surname” part and a “given name” part.


Surname translator
Selecting appropriate surname translations.
 Chinese surname database
 The strength of relationship between each surname translation and the person is
determined by the “distance from the person’s Chinese name to the surname’s translation”.


Web page retriever



Making the concept of the query word more clearly.
Retrieving the related Web pages back.
The new query word will be “(Chinese name) + (Surname’s translation)”.
74
Distance from Two Terms
 Calculation of the “distance from two terms”:
DN
where D is the distance, N is the number of non-words between the
two terms.
陳威達( Wei-Da Chen)

The distance from the person’s
Chinese name (陳威達) to the
surname’s translation (Chen) is 3.
75
Candidate Extractor

Goal:
To extract possible candidates from the retrieved Web page
snippets.

Steps:
1. Removing all HTML tags.
2. Identifying out all the positions of the Chinese surnames existing in
the snippets.
 Chinese surname database
3. Extracting any English terms near each surname in the snippets if the
term has one of the following properties:
–
–
–
The term cannot be found in a common dictionary.
The term is a Western first name.
The length of the term is 1.
※At most three English terms in the neighborhood of the surname
will be extracted.
76
System Architecture 4/10
- Candidate extractor
The extracted terms will be
Step1
Identifying
the
name translation
out all the
positions ofand
candidates
thebe
Chinese
sent to
surnames existing
SVM-based
name in
selector
the
snippets.
for
processing
Step2
Extracting any English terms near
each surname in the snippets if the
term has one of the following
properties:
•The term cannot be found in a common
dictionary.
•The term is a Western first name.
•The length of the term is 1.
77
SVM-based Name Selector

Goal:
To extract each candidate’s features and utilize them to determine
whether the candidate is the correct translation of the input
Chinese name.

Features:
1. The phonetic feature:
–
Phonetic similarity
 Soundex algorithm
2. The distant feature:
–
–
Smallest distance (between the Chinese name and the translation
candidates)
Number of appearance in the neighborhood
78
Distant Features

The “neighborhood”:


The close area of each occurrence of the Chinese name.
The close area is defined by a given threshold of distance of
number of words.
Smallest distance 2
Number of
appearance
in the neighborhood
of the candidate
“win-shih”: 2
79
Summary
 Query expansion technique
 Retrieving related Web pages.
 Knowledge-based method
 Extracting appropriate name translation candidates from the
retrieved Web pages.
 SVM
 Learning the verification rule and
 Selecting appropriate name translation
candidates.
from
extracted
80
Agenda
 Introduction
 Proposed Approach
 Experiments
 Conclusions and Future Work
81
Testing Environment and Dataset 1/3
 The following tool are used:
 Cambridge on-line dictionary
 Google search engine
 LIBSVM
 Two datasets are used:
 Dataset I (training & testing):
 Collected from the Directory of scholars of Institute of Mathematics.
 Contains 78 pieces of data.
 Dataset II (testing):
 Collected by our program from the Website of the Directory of Division of
Computer Science of National Science Council.
 Contains 1,157 pieces of data, and the name translations of 40 data are not
existed in Google.
82
Testing Environment and Dataset 2/3
Name format
Example
Dataset I
Dataset II
#
%
#
%
Type-1. (Chinese given name) (Surname) or
(Surname), (Chinese given name)
丁建文(Jen-Wen Ding)
丁德榮(Der-Rong Din)
歐陽明(Ming Ouhyang)
19
24.3%
1000
89.5%
Type-2. (Merged Chinese given name) (Surname)
蔡丕裕(Piyu Tsai)
10
12.8%
42
3.8%
Type-3. (Western first name) (Surname)
賴友仁(Eugene Lai)
9
11.5%
9
0.8%
Type-4. (Chinese given name)
(Western first name) (Surname)
劉立頌(Alan Li-Sung liu)
陳嘉懿(Jia-Yih Joy Chen)
楊豐瑞(Fongray Frank Young)
14
17.9%
50
4.5%
Type-5. (Abbreviated Chinese given name)
(Surname)
洪英超(I.-C. Hung)
3
3.8%
0
0%
Type-6. (Western first name) (Abbreviated
Chinese given name) (Surname)
曾秋蓉(Judy C. R. Tseng)
8
10.3%
9
0.8%
Type-7. (Chinese given name) (Abbreviated
Chinese given name) (Surname)
黃哲志(Tetz C. Huang)
3
3.8%
3
0.4%
Type-8. (Chinese given name) (Unpredictable
Surname)
張肇健(Trieu-Kien Truong)
12
15.4%
4
0.4%
83
Testing Environment and Dataset 3/3
The alignment accuracy
 Proposed by Huang (2005).
 The probability of selecting the correct answers when the searched
snippets contain the correct answers.
N cc
 AAi 
Nd
where
 Ai : The alignment accuracy of candidate i.
 Nd : The number of testing data.
 Ncc : The number of correct translation.
 Performance measurement: Top-1 to Top-5 alignment accuracy.
84
Results and Analysis 1/3
- Overall performance on Dataset I
70.5% top-1 accuracy
91% top-5 accuracy
85
Results and Analysis 2/3
- Overall performance on Dataset II
57.9% top-1 accuracy
86.2% top-5 accuracy
86
Results and Analysis 3/3
- Performance of each name type
Name
format
Example
Type-1
丁建文(Jen-Wen Ding)
丁德榮(Der-Rong Din)
歐陽明(Ming Ouhyang)
Type-2
蔡丕裕(Piyu Tsai)
Type-3
賴友仁(Eugene Lai)
Type-4
劉立頌(Alan Li-Sung liu)
陳嘉懿(Jia-Yih Joy
Chen)
Type-5
洪英超(I.-C. Hung)
Type-6
曾秋蓉(Judy C. R. Tseng)
Type-7
黃哲志(Tetz C. Huang)
Type-8
張肇健(Trieu-Kien
Truong)
Our system
performs better in
type-1, type-2,
type-4, type-6.
87
Discussions
 Major reason for the low performance on Type-3, Type-5,
Type-7 and Type-8
 The lack of Web information.
 Usually more than one correct name translations for an
input Chinese name are found out.
 The name ambiguity problem.
88
Limitations
 Uncommon surname
 Rely on Web resources
 Search engine selecting
 No name disambiguation
89
Agenda
 Introduction
 Proposed Approach
 Experiments
 Conclusions
90
Conclusions
 Mining information through Web corpora is
effective for dealing with person name translation
problem
 Name ambiguity problem arises frequently
91
Thank You
Jan-Ming Ho
hoho@iis.sinica.edu.tw
Institute of Information Science
Academia Sinica
92
Download