Wikification

advertisement
Wikification
CSE 6339 (Section 002)
Abhijit Tendulkar
1
Wikify! Linking Documents to
Encyclopedic Knowledge. R. Mihalcea and
A. Csomai
Learning to Link with Wikipedia. D. Milne
and I. H. Witten
2
What is Wikification
• Automatic keyword extraction
• Word sense disambiguation
• Automatically cross-reference documents
(unstructured text) with wikipedia.
3
Wikify! - Introduction
• Introduces annotation of documents by linking
them with Wikipedia
• Applications could be semantic web, educational
applications, useful in no. of text processing
problems.
• Previous similar works: Microsoft Smart Tags,
Google AutoLink merely based on word or phrase
lookup (no keyword extraction or disambiguation)
4
Wikify! - Text Wikification
QuickTime™ and a
decompressor
are needed to see this picture.
5
Wikify! - Keyword Extraction
• Recommendations from Wikipedia style manual:
link terms providing deeper understanding of
topic, avoid linking unrelated terms, select proper
amount of keywords.
• Unsupervised algorithms: Involve two steps
– Candidate extraction: extract all possible n-grams.
– Keyword ranking: Assign numeric value to each
candidate. Used three methods - tf-idf, 2,
Keyphraseness.
6
Wikify! - Evaluation of Keyword
Extraction
QuickTime™ and a
decompressor
are needed to see this picture.
7
Wikify! - Word Sense
Disambiguation
• Ambiguity is inherent to human language
• Disambiguation algorithms:
– Knowledge-based: rely exclusively on knowledge
derived from dictionaries.
– Data-driven: based on probabilities collected from
sense-annotated data.
• Here voting scheme is used which seeks
agreement between both.
• Wikify! provides highly precise annotation even if
recall is lower.
8
Wikify! - Disambiguation
Evaluation
QuickTime™ and a
decompressor
are needed to see this picture.
Word sense disambiguation results: total number of attempted (A) and correct (C)
word senses, together with precision (P), recall (R) and F-measure (F)
evaluations.
9
Wikify! - Overall Evaluation and
Conclusion
• Wikify! allows user to upload a text file or accepts
URL of webpage, processes the document
provided by the user, and finally returns the
wikified version of the document.
• The user also has option of providing density of
keywords in the range 2%-10% default being 6%.
• When it was evaluated by human evaluators (20
users evaluating 10 documents each) only 57% of
the cases were identified accurately (50% would
be ideal case).
10
Learning to Link with Wikipedia
• Machine learning approach to identify
significant terms within unstructured text.
• It can provide structured knowledge about
any unstructured text.
• Uses Wikipedia articles as training data,
which improves recall and precision.
11
Snapshot of Wikified document
QuickTime™ and a
decompressor
are needed to see this picture.
12
Learning to Disambiguate Links
• Uses disambiguation to inform detection.
• Features such as Commonness and Relatedness of
the term are used as measures to resolve
ambiguity.
• Commonness of a sense is defined by number of
times it is used by wikipedia articles as
destination.
• Commonness = (No. of times term is used as link)
/ (No. of times term appears in Wikipedia articles)
13
Disambiguation (Continued)
• Relatedness is given by following formula:
QuickTime™ and a
decompressor
are needed to see this picture.
Where a and b are two articles of interest A and B are
sets of all articles that link to a and b respectively, and W
is set of all articles in Wikipedia.
14
Disambiguation (Continued)
Commonness and Relatedness
QuickTime™ and a
decompressor
are needed to see this picture.
15
Disambiguation (Continued)
• All context terms are not equally useful, so weight
is assigned to each context term which is average
of its link probability (i.e. commonness) and
relatedness.
• All the above features are combined and the
feature of context quality is defined as sum of the
weights that are previously assigned to each
context term.
• These features are used to train the classifier.
• To configure the classifier, parameter specifying
minimum probability of sense is used.
16
Disambiguation Evaluation
• Disambiguation classifier was trained over 500 articles
(instead of entire Wikipedia) on a modest desktop with 3
GHz dual Core processor and 4GB of RAM.
• Classifier was configured using 100 wikipedia articles.
• It was trained in 13 minutes, and tested in 4 minutes and
another 3 minutes were required to load required
summaries of Wikipedia’s link structure and anchor
statistics into memory.
• To evaluate classifier, 11000 anchors were gathered from
100 random articles.
17
Disambiguation Evaluation
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
18
Learning to Detect Links
• Central difference between Wikify’s link detection
approach and this new link detector: Wikify
exclusively relies on link probability, whereas in
this new approach, the context surrounding the
terms is also taken into consideration.
• This link detector discards only terms having very
low link probability so that nonsense phrases and
stop words are removed.
19
Features used for Link Detection
• Link probability: It considers average link
probability.
• Relatedness: semantic relatedness, average
relatedness between each topic and all other
candidates.
• Disambiguation Confidence
• Generality
• Location and Spread
20
Link Detector
QuickTime™ and a
decompressor
are needed to see this picture.
21
Link Detector Performance
• Same dataset as for disambiguation classifier was
used for training, configuration as well as
evaluation.
• 6.5% link probability was set as recall and
precision balance at that point.
• Link detector was trained on unambiguous terms.
22
Link Detector Performance
(Continued)
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
23
Wikification in the Wild
• This system was tested using news articles instead
of wikipedia and it gave 76.4% accuracy in link
detection.
QuickTime™ and a
decompressor
are needed to see this picture.
24
Conclusions
• This system resolves ambiguity as well as
polysemy.
• Common hurdle in all such applications: they
must somehow move from unstructured text to
collection of relevant wikipedia articles.
• This paper has contibuted proven method for
extracting key concepts from plain text.
• Finally these are attempts to explain and organize
sum total of human knowledge.
25
Application on itself
QuickTime™ and a
decompressor
are needed to see this picture.
26
Questions
?
27
Download