Semantic web Bootstrapping & Annotation Hassan Sayyadi sayyadi@ce.sharif.edu Semantic web research laboratory Computer department Sharif university of technology Outline • • • • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 2 Outline • • • • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 3 What is annotation? • People make notes to themselves in order to preserve ideas that arise during a variety of activities • The purpose of these notes is often to summarize, criticize, or emphasize specific phrases or events • Semantic annotations are to tag ontology class instance data and map it into ontology classes. 4 Outline • • • • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 5 Why use annotation? • To have the world knowledge at one's finger tips seems possible. • The Internet is the platform for information. • Unfortunately most of the information is provided in an unstructured and nonstandardized form. 6 Why use annotation? (continue) 7 Outline • • • • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 8 Crawler • A crawler is a program which traverses the Internet following these links from one page to the next. 9 Focused crawler • Not all the Internet knowledge is required for every query. • This assumption seems reasonable because most people work on a restricted domain and do not need the knowledge of the whole Internet • Searching the whole Internet in this case is very inefficient and expensive. • Free texts in the Internet contain various information in diverse domains. 10 Focused crawler (continue) • The focus can be achieved by examining keywords • Problems: – “Understanding“ the semantic of document – Extremely focusing on one topic • Another way to focus is the Internet connectivity structure 11 Outline • • • • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 12 Annotation models • Mark in web page • Example: – SUT is one of the largest engineering schools in the Islamic Republic of Iran – <university>SUT</university> is one of the largest universities in the <country>Islamic Republic of Iran</country> 13 Annotation models (continue) • Generate RDF • Example: – SUT is one of the largest engineering schools in the Islamic Republic of Iran – <rdf:Description rdf:about="http://sharif.edu/#SUT"> <rdf:type>university</rdf:type> <SHARIF:be_in rdf:resource="http://sharif.edu/#Islamic+Republic+of+Iran"/> </rdf:Description> <rdf:Description rdf:about="http://sharif.edu/#Islamic+Republic+of+Iran”> <rdf:type>Country</rdf:type> </rdf:Description> 14 Outline • • • • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 15 Annotation methods • Manually • Semi-automatically • Automatically 16 Automatic Annotation • The fully automatic creation of semantic annotations is an unsolved problem. • Automatic semantic annotation for the natural language sentences in these pages is a daunting task and we are often forced to do it manually or semiautomatically using handwritten rules 17 Manual Annotation • Manual annotation is more easily accomplished today, using authoring tools, which provide an integrated environment for simultaneously authoring and annotating text. • However, the use of human annotators is often fraught with errors due to factors such as annotator familiarity with the domain, amount of training, personal motivation and complex schemas • Manual annotation is also an expensive process 18 Semi-automatic Annotation • To overcome the annotation acquisition bottleneck, semiautomatic annotation of documents has been proposed. 19 Semi-automatic annotation • assumptions: – vocabulary set is limited – word usage has patterns – semantic ambiguities are rare – terms and jargon of the domain appear frequently 20 Semantic Annotation Platform (SAP) 21 Multistrategy SAPs • Multistrategy SAPs are able to combine methods from both pattern-based and machine learning-based systems. • No SAP currently implements the multistrategy approach for semantic annotation, although it has been implemented in systems for ontology extraction (such as On-To-Knowledge) 22 Semi-automatic annotation (continue) • Example – I go to Shanghai • Link structure is more like a RDF graph 23 The accuracy of concepts and relations about different algorithm 24 Automatic annotation 25 Source preprocessing • • • • Document Object Model (DOM) Text Model Layout Model NLP Model 26 Information Identification • Operators – perform extraction actions on document access models – Retrieval, Check, Execute • Strategies – build operator sequences according to user time and quality requirements • Source Description – build operator sequences according to user time and quality requirements 27 Ontology population • The final stage of the overall process is to decide which hypothesis represents the extracted information to insert into the ontology • The module simulates insertions and calculates the cost according to the number of new instance creations, instance modifications or inconsistencies found 28 Outline • • • • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 29 Our implementation • Crawler: – Crawl all link that contains: • sharif.ir • sharif.edu • sharif.ac.ir 30 Our implementation • Source pre-processing – Html to text • • • • • • • • • text = text.replaceAll("\n", "*_newline_*"); text = text.replaceAll("\\<script.*?\\</script\\>", ""); text = text.replaceAll("\\<style.*?</style.*\\>", ""); text = text.replaceAll("<\\!--.*?--\\>", ""); text = text.replaceAll("\\<.*?\\>", ""); text = text.replaceAll("&nbsp;", " "); text = text.replaceAll("&lt;", "<"); … text = text.replaceAll("\\*_newline_\\*", "\n"); – Additional • text = text.replaceAll("\n(\n|| )*\n","."); • text = text.replaceAll(",", " and "); 31 Our implementation • Information extraction: – JMontyLingua • SUT is one of the largest engineering schools in the Islamic Republic of Iran • ("be" "SUT" "one" "of largest engineering school" "in Islamic Republic" "of Iran") 32 Our implementation • JMontyLingua problem: – SUT has computer, mechanic and electric engineering departments – ("have" "SUT" "computer mechanic and electric engineering departments") – ("have" "SUT" "computer and mechanic and electric engineering departments") 33 Our inplementation • ("be" "SUT" “university" "in Islamic Republic" "of Iran") • => ("be" "SUT" “university" "in Islamic Republic of Iran") • =>SUT,be,university & SUT,be_in,Islamic Republic of Iran • <rdf:Description rdf:about="http://sharif.edu/#SUT"> <rdf:type>university</rdf:type> <SHARIF:be_in rdf:resource="http://sharif.edu/#Islamic+Republic+of+Iran"/> </rdf:Description> 34 Any question? 35