Wikification CSE 6339 (Section 002) Abhijit Tendulkar 1 Wikify! Linking Documents to Encyclopedic Knowledge. R. Mihalcea and A. Csomai Learning to Link with Wikipedia. D. Milne and I. H. Witten 2 What is Wikification • Automatic keyword extraction • Word sense disambiguation • Automatically cross-reference documents (unstructured text) with wikipedia. 3 Wikify! - Introduction • Introduces annotation of documents by linking them with Wikipedia • Applications could be semantic web, educational applications, useful in no. of text processing problems. • Previous similar works: Microsoft Smart Tags, Google AutoLink merely based on word or phrase lookup (no keyword extraction or disambiguation) 4 Wikify! - Text Wikification QuickTime™ and a decompressor are needed to see this picture. 5 Wikify! - Keyword Extraction • Recommendations from Wikipedia style manual: link terms providing deeper understanding of topic, avoid linking unrelated terms, select proper amount of keywords. • Unsupervised algorithms: Involve two steps – Candidate extraction: extract all possible n-grams. – Keyword ranking: Assign numeric value to each candidate. Used three methods - tf-idf, 2, Keyphraseness. 6 Wikify! - Evaluation of Keyword Extraction QuickTime™ and a decompressor are needed to see this picture. 7 Wikify! - Word Sense Disambiguation • Ambiguity is inherent to human language • Disambiguation algorithms: – Knowledge-based: rely exclusively on knowledge derived from dictionaries. – Data-driven: based on probabilities collected from sense-annotated data. • Here voting scheme is used which seeks agreement between both. • Wikify! provides highly precise annotation even if recall is lower. 8 Wikify! - Disambiguation Evaluation QuickTime™ and a decompressor are needed to see this picture. Word sense disambiguation results: total number of attempted (A) and correct (C) word senses, together with precision (P), recall (R) and F-measure (F) evaluations. 9 Wikify! - Overall Evaluation and Conclusion • Wikify! allows user to upload a text file or accepts URL of webpage, processes the document provided by the user, and finally returns the wikified version of the document. • The user also has option of providing density of keywords in the range 2%-10% default being 6%. • When it was evaluated by human evaluators (20 users evaluating 10 documents each) only 57% of the cases were identified accurately (50% would be ideal case). 10 Learning to Link with Wikipedia • Machine learning approach to identify significant terms within unstructured text. • It can provide structured knowledge about any unstructured text. • Uses Wikipedia articles as training data, which improves recall and precision. 11 Snapshot of Wikified document QuickTime™ and a decompressor are needed to see this picture. 12 Learning to Disambiguate Links • Uses disambiguation to inform detection. • Features such as Commonness and Relatedness of the term are used as measures to resolve ambiguity. • Commonness of a sense is defined by number of times it is used by wikipedia articles as destination. • Commonness = (No. of times term is used as link) / (No. of times term appears in Wikipedia articles) 13 Disambiguation (Continued) • Relatedness is given by following formula: QuickTime™ and a decompressor are needed to see this picture. Where a and b are two articles of interest A and B are sets of all articles that link to a and b respectively, and W is set of all articles in Wikipedia. 14 Disambiguation (Continued) Commonness and Relatedness QuickTime™ and a decompressor are needed to see this picture. 15 Disambiguation (Continued) • All context terms are not equally useful, so weight is assigned to each context term which is average of its link probability (i.e. commonness) and relatedness. • All the above features are combined and the feature of context quality is defined as sum of the weights that are previously assigned to each context term. • These features are used to train the classifier. • To configure the classifier, parameter specifying minimum probability of sense is used. 16 Disambiguation Evaluation • Disambiguation classifier was trained over 500 articles (instead of entire Wikipedia) on a modest desktop with 3 GHz dual Core processor and 4GB of RAM. • Classifier was configured using 100 wikipedia articles. • It was trained in 13 minutes, and tested in 4 minutes and another 3 minutes were required to load required summaries of Wikipedia’s link structure and anchor statistics into memory. • To evaluate classifier, 11000 anchors were gathered from 100 random articles. 17 Disambiguation Evaluation QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. 18 Learning to Detect Links • Central difference between Wikify’s link detection approach and this new link detector: Wikify exclusively relies on link probability, whereas in this new approach, the context surrounding the terms is also taken into consideration. • This link detector discards only terms having very low link probability so that nonsense phrases and stop words are removed. 19 Features used for Link Detection • Link probability: It considers average link probability. • Relatedness: semantic relatedness, average relatedness between each topic and all other candidates. • Disambiguation Confidence • Generality • Location and Spread 20 Link Detector QuickTime™ and a decompressor are needed to see this picture. 21 Link Detector Performance • Same dataset as for disambiguation classifier was used for training, configuration as well as evaluation. • 6.5% link probability was set as recall and precision balance at that point. • Link detector was trained on unambiguous terms. 22 Link Detector Performance (Continued) QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. 23 Wikification in the Wild • This system was tested using news articles instead of wikipedia and it gave 76.4% accuracy in link detection. QuickTime™ and a decompressor are needed to see this picture. 24 Conclusions • This system resolves ambiguity as well as polysemy. • Common hurdle in all such applications: they must somehow move from unstructured text to collection of relevant wikipedia articles. • This paper has contibuted proven method for extracting key concepts from plain text. • Finally these are attempts to explain and organize sum total of human knowledge. 25 Application on itself QuickTime™ and a decompressor are needed to see this picture. 26 Questions ? 27