12. Annotation Survey.ppt

advertisement
Semantic web Bootstrapping
&
Annotation
Hassan Sayyadi
sayyadi@ce.sharif.edu
Semantic web research laboratory
Computer department
Sharif university of technology
Outline
•
•
•
•
•
•
What is annotation?
Why use annotation?
Crawler
Annotation model
Annotation methods
Our Implementation
2
Outline
•
•
•
•
•
•
What is annotation?
Why use annotation?
Crawler
Annotation model
Annotation methods
Our Implementation
3
What is annotation?
• People make notes to themselves in order to
preserve ideas that arise during a variety of
activities
• The purpose of these notes is often to
summarize, criticize, or emphasize specific
phrases or events
• Semantic annotations are to tag ontology
class instance data and map it into ontology
classes.
4
Outline
•
•
•
•
•
•
What is annotation?
Why use annotation?
Crawler
Annotation model
Annotation methods
Our Implementation
5
Why use annotation?
• To have the world knowledge at one's
finger tips seems possible.
• The Internet is the platform for
information.
• Unfortunately most of the information is
provided in an unstructured and nonstandardized form.
6
Why use annotation?
(continue)
7
Outline
•
•
•
•
•
•
What is annotation?
Why use annotation?
Crawler
Annotation model
Annotation methods
Our Implementation
8
Crawler
• A crawler is a program which traverses
the Internet following these links from
one page to the next.
9
Focused crawler
• Not all the Internet knowledge is required for
every query.
• This assumption seems reasonable because
most people work on a restricted domain and
do not need the knowledge of the whole
Internet
• Searching the whole Internet in this case is
very inefficient and expensive.
• Free texts in the Internet contain various
information in diverse domains.
10
Focused crawler (continue)
• The focus can be achieved by
examining keywords
• Problems:
– “Understanding“ the semantic of document
– Extremely focusing on one topic
• Another way to focus is the Internet
connectivity structure
11
Outline
•
•
•
•
•
•
What is annotation?
Why use annotation?
Crawler
Annotation model
Annotation methods
Our Implementation
12
Annotation models
• Mark in web page
• Example:
– SUT is one of the largest engineering
schools in the Islamic Republic of Iran
– <university>SUT</university> is one of the
largest universities in the <country>Islamic
Republic of Iran</country>
13
Annotation models (continue)
• Generate RDF
• Example:
– SUT is one of the largest engineering schools in the Islamic
Republic of Iran
– <rdf:Description rdf:about="http://sharif.edu/#SUT">
<rdf:type>university</rdf:type>
<SHARIF:be_in
rdf:resource="http://sharif.edu/#Islamic+Republic+of+Iran"/>
</rdf:Description>
<rdf:Description rdf:about="http://sharif.edu/#Islamic+Republic+of+Iran”>
<rdf:type>Country</rdf:type>
</rdf:Description>
14
Outline
•
•
•
•
•
•
What is annotation?
Why use annotation?
Crawler
Annotation model
Annotation methods
Our Implementation
15
Annotation methods
• Manually
• Semi-automatically
• Automatically
16
Automatic Annotation
• The fully automatic creation of semantic
annotations is an unsolved problem.
• Automatic semantic annotation for the
natural language sentences in these
pages is a daunting task and we are
often forced to do it manually or semiautomatically using handwritten rules
17
Manual Annotation
• Manual annotation is more easily accomplished
today, using authoring tools, which provide an
integrated environment for simultaneously authoring
and annotating text.
• However, the use of human annotators is often
fraught with errors due to factors such as annotator
familiarity with the domain, amount of training,
personal motivation and complex schemas
• Manual annotation is also an expensive process
18
Semi-automatic Annotation
• To overcome the annotation acquisition
bottleneck, semiautomatic annotation of
documents has been proposed.
19
Semi-automatic annotation
• assumptions:
– vocabulary set is limited
– word usage has patterns
– semantic ambiguities are rare
– terms and jargon of the domain appear
frequently
20
Semantic Annotation Platform (SAP)
21
Multistrategy SAPs
• Multistrategy SAPs are able to combine
methods from both pattern-based and
machine learning-based systems.
• No SAP currently implements the
multistrategy approach for semantic
annotation, although it has been
implemented in systems for ontology
extraction (such as On-To-Knowledge)
22
Semi-automatic annotation
(continue)
• Example
– I go to Shanghai
• Link structure is
more like a RDF
graph
23
The accuracy of concepts
and relations about different
algorithm
24
Automatic annotation
25
Source preprocessing
•
•
•
•
Document Object Model (DOM)
Text Model
Layout Model
NLP Model
26
Information Identification
• Operators
– perform extraction actions on document access
models
– Retrieval, Check, Execute
• Strategies
– build operator sequences according to user time
and quality requirements
• Source Description
– build operator sequences according to user time
and quality requirements
27
Ontology population
• The final stage of the overall process is to
decide which hypothesis represents the
extracted information to insert into the
ontology
• The module simulates insertions and
calculates the cost according to the number
of new instance creations, instance
modifications or inconsistencies found
28
Outline
•
•
•
•
•
•
What is annotation?
Why use annotation?
Crawler
Annotation model
Annotation methods
Our Implementation
29
Our implementation
• Crawler:
– Crawl all link that contains:
• sharif.ir
• sharif.edu
• sharif.ac.ir
30
Our implementation
• Source pre-processing
– Html to text
•
•
•
•
•
•
•
•
•
text = text.replaceAll("\n", "*_newline_*");
text = text.replaceAll("\\<script.*?\\</script\\>", "");
text = text.replaceAll("\\<style.*?</style.*\\>", "");
text = text.replaceAll("<\\!--.*?--\\>", "");
text = text.replaceAll("\\<.*?\\>", "");
text = text.replaceAll(" ", " ");
text = text.replaceAll("<", "<");
…
text = text.replaceAll("\\*_newline_\\*", "\n");
– Additional
• text = text.replaceAll("\n(\n|| )*\n",".");
• text = text.replaceAll(",", " and ");
31
Our implementation
• Information extraction:
– JMontyLingua
• SUT is one of the largest engineering schools
in the Islamic Republic of Iran
• ("be" "SUT" "one" "of largest engineering
school" "in Islamic Republic" "of Iran")
32
Our implementation
• JMontyLingua problem:
– SUT has computer, mechanic and electric
engineering departments
– ("have" "SUT" "computer mechanic and
electric engineering departments")
– ("have" "SUT" "computer and mechanic
and electric engineering departments")
33
Our inplementation
• ("be" "SUT" “university" "in Islamic Republic" "of Iran")
• => ("be" "SUT" “university" "in Islamic Republic of Iran")
• =>SUT,be,university & SUT,be_in,Islamic Republic of Iran
• <rdf:Description rdf:about="http://sharif.edu/#SUT">
<rdf:type>university</rdf:type>
<SHARIF:be_in
rdf:resource="http://sharif.edu/#Islamic+Republic+of+Iran"/>
</rdf:Description>
34
Any question?
35
Download