Semantic Web Architecture

advertisement
COMPSCI 732:
Semantic Web Technologies
Semantic Web Architecture
Slides are based on Lecture Notes by Dieter Fensel and Federico Facca
1
Where are we?
#
Title
1
Introduction
2
Semantic Web Architecture
3
Resource Description Framework (RDF)
4
Web of Data
5
Generating Semantic Annotations
6
Storage and Querying
7
Web Ontology Language (OWL)
8
Rule Interchange Format (RIF)
2
Overview
•
•
Introduction and motivation
Technical solutions
–
–
–
–
–
•
•
•
•
Semantic Web architecture
Uniform Resource Identifier
eXtensible Markup Language (XML)
XML Schema
Namespaces
Extensions
Illustration by a large example
Summary
References
3
INTRODUCTION AND
MOTIVATION
4
A Semantic Web Scenario From Today
•
Queries:
– Which type of music is played by UK radio stations?
– Which UK radio station is playing titles by Swedish composers?
•
Information to answer query is available on the Web
•
Web search engines analyze Web content one page at a time
•
The Semantic Web provides better framework to answer such queries
– combines data
– distributed across different sources, and
– described in machine-interpretable manner
5
Steps in Answering Queries
•
Playlists of BBC radio shows published online in Semantic Web formats
•
Music groups such as “ABBA” have an identifier
http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist
•
Identifier can relate music group to information at Musicbrainz
–
–
–
–
•
Music community portal exposing data on Semantic Web
http://musicbrainz.org
Knows about band members (e.g. Benny Andersson)
Aligns its information with Wikipedia
Information on UK radio stations may be found in lists on Web pages
– Can be translated into similar Semantic Web representation
6
Describing Things and Their Relationships
•
Meaning of Relationships, e.g., band memberships explained online, too
•
Using collections of Ontologies available on the Web
– Dublin Core (general properties of information resources)
http://dublincore.org/
– SKOS (covering taxonomic descriptions)
http://www.w3.org/2004/02/skos/
– Specialized ontologies (covering the music domain)
•
Data at the BBC currently use at least nine different ontologies
– http://www.bbc.co.uk/ontologies/programmes
•
Availability of data in these formats enables queries to be answered
– Based on a query language
7
Towards the Required Infrastructure
•
What infrastructure is required to implement the scenario from before?
•
Generic software components, Languages, Protocols
•
Their seamless interaction to satisfy requests
•
Purpose of Lecture:
–
–
–
Investigate Semantic Web Architecture
Analyze requirements from technical need to identify and relate data
Analyze organizational needs to maintain Semantic Web as a whole
8
Web Architecture
•
The Semantic Web is an evolution of the Web
•
Important for the fast growth and adoption of the Web are
– Many people can set up Web servers easily and independently from each other
– More people can create documents, put them online, and link them to each other
– Even more people can browse and access any Web server to retrieve documents
•
Web architecture allows graceful degradation of user experience when:
– Network is partially slow (World Wide Wait), while other parts still operate at full speed
– Single Web servers break, because others still work
– Hyperlinks are broken, because other links still lead somewhere
•
Separation of concerns justifies less quality outputs
– Users can easily create and access documents
– Distributed nature of system, without need of central coordinator, results in robustness
9
Web Architecture Principles
1.
Explicit simple data representation
– Common data representation hides underlying technologies (e.g. HTML)
2.
Distributed system
– Data sources without centralized instance controlling who owns what type of info
– Distributed ownership and control can facilitate adoption and scalability
– E.g. Web pages are under full control of their producers
3.
Cross-referencing
– Reuse of existing data and data definitions from different authorities (e.g. hyperlinks)
4.
Loose coupling achieved by common language layers
–
–
–
–
5.
Communication in standardized languages
These must be easy to customize
Overall communication must not be jeopardized by such specialization
E.g. Coupling of Web clients/servers: HTTP for transport, HTML for Web content
Ease of publishing and consumption
– Easy publishing and consumption of simple data
– Comprehensive publishing and consumption of complex data, e.g.:
HTML simple to convey textual info; powerful browsers/content management systems
10
Semantic Web Requirements and Examples
•
Must be able to represent entities and their relationships (1)
–
•
Must be serializable in standardized manner to easily exchange data between
different computing nodes (1,2,4)
–
•
The number of Swedish composers being broadcast on a specific program
Reasoning desirable to facilitate querying (5)
–
•
•
Manual inspection not scalable; refinements of basic model impossible
BBC Data involves radio stations, shows, their versions, songs and their artists
A query and manipulation language to select and aggregate data (5)
–
•
ABBA’s Benny Andersson becomes hard to distinguish from other Benny Anderssons
Expressive, machine-understandable data description language (1,4,5)
–
–
•
Ease of joining information from MusicBrainz, BBC, DBPedia
Entities must be referable across borders of ownership or computing systems
to allow for cross-linking of data (1,2,3,4)
–
•
A person, the birthday of a person, the name of a person (“Benny Andersson”)
Direct relationship between a program and a song using inference
Transport of data and query and their results by agreed-upon protocols (HTTP)
May involve encrypted data requests and transports (HTTPs); signature of data
items to ensure authenticity of user requests and control access to resources
11
Additional Requirements
•
•
Core requirements not yet included in language architecture
Versatile means for user interaction
– Broad accessibility requires viewing, searching, browsing, querying of data
– While at the same time abstracting from intricacies underlying their distributed origin
– On-the-fly data integration of multiple data sources: assemble information from
multitude of sources without a priori knowledge about domain or structure of data
– Facilitation of data production and publishing: metadata creation and migration of data
must be made convenient, independent from origin of data
•
Provenance and Trust
– Authorship and ownership get lost during data processing and aggregation
– Origin, Reliability, Trustworthiness must be rethought to apply them for individual and
aggregated data items, to establish faithful authentication at Semantic Web scale
•
Alignment of unconnected sets of data
– Interlinking implies capability to suggest alignments between identifiers or concepts
from different sets of data, beyond mere use of identifiers such as URI/IRIs
– Such alignment may be necessary to enable a real Web of Data
12
Semantic Web Architecture
•
Formalized components and their relationships
– What technologies make up the Semantic Web
– What are the dependencies between components
•
Roadmap for steps of developing the Semantic Web
13
The Semantic Web architecture and its foundations
TECHNICAL SOLUTION
14
Search and Query the Web I
•
The Web is a constantly growing network of distributed resources
–
–
–
–
•
More than 1 trillion unique URLs
More than 100 billion pages
More than 200 million web sites
Check most updated data on:
http://news.netcraft.com/archives/web_server_survey.html
User needs to be able to efficiently search resources/content over the
Web
– When I Google “Milan” do I find info on the city or the soccer team?
•
User needs to be able to perform query over largely distributed
resources
– When is the next performance of the rock band “U2”, where it will be located, what are
the best ways to reach the location, what are the attractions nearby…
15
Search and Query the Web II
•
On2Broker is the evolution of Ontobroker, a systems that aims at
providing a solution to the problems discussed in the previous slides by
adopting Semantic Technologies
•
On2Broker is a system that processes distributed information sources
and that provides intelligent information retrieval, query answering
•
On2Broker relies on components of the Semantic Web Architecture
[D. Fensel, S. Decker, M. Erdmann, R. Studer: Ontobroker in a Nutshell. ECDL 1998: 663664]
16
On2Broker: Architecture
17
On2Broker Components I
•
Query Interface
– Provides a structured input that enables users to define their queries without any
knowledge of the query language
– Input queries are then transformed to the query language (e.g. SparQL)
•
Repository
– Decouples query answering, information retrieval and reasoning
– Provide support for materialization of inferred knowledge
18
On2Broker Components II
•
Crawlers and Wrappers (or Info Agent)
– Extract knowledge from different distributed and heterogeneous data sources
– RDFa pages and RDF repositories can be included directly
– HTML and XML data sources require processing by wrappers to derive RDF data
•
Inference Engine
– Relies on knowledge imported from the crawlers and axioms contained in the
repository to support query answers
– Adopts Horn logic and closed world assumption
19
On2Broker: Example
5.
1. Whom does Tim
Berners-Lee know?
Tim Berners-Lee knows
Christian Bizer and Tom
Heath
2. SELECT DISTINCT ?s ?o
WHERE { ?s foaf:knows
?o . } …
3.
4.
Extends KB:
if “x dblp:coauthor y“
then “x foaf:knows y”
if “y foaf:knows x“ then
“x foaf:knows y”
3. Extract RDF from:
Extract RDF from:
http://www.w3.org/P
fensel.com
eople/Berners-Lee/
dblp
dblp
…
…
20
SemWeb Architecture: Requirements
•
Extensibility
– Each layer should extend the previous one(s)
•
Support for data interchange
– Using data from one source in other applications
•
Support for ontology description with different complexity
– Including rules
•
•
Support for data query
Support for data provenance and trust evaluation
see the Semantic Web Roadmap: http://www.w3.org/DesignIssues/Semantic.html
21
Semantic Web Stack
Adapted from http://en.wikipedia.org/wiki/Semantic_Web_Stack
Rules:
RIF
22
UNICODE, URI and XML
•
UNICODE is the standard international character set
– E.g. used to encode the data in the repository
•
Uniform Resource Identifiers (URIs) identify things and concepts
– E.g. used to identify resources on the Web and in the repository
– Be aware to distinguish between information and non-information resources
– http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist vs.
http://dbpedia.org/resource/ABBA
– Data publishers on the Semantic Web use Linked data principles:
•
•
•
•
•
Use URIs as names for things
Use HTTP URIs so that people can look up those names
When someone looks up a URI, provide useful information, using standards (RDF,SPARQL)
Include links to other URIs, so that they can discover more things.
eXtensible Markup Language (XML) used for data exchange
– Used on the Semantic Web to exchange the description of resources
– E.g. format that can be transformed into RDF and imported into the repository
23
RDF, RDFS and OWL
• Resource Description Framework (RDF)
–
–
–
–
–
–
–
is the HTML of the Semantic Web
Simple way to describe resources on the Web
Based on triples <subject, predicate, object>
Various serializations, including one based on XML
A simple ontology language (RDFS)
E.g. language used to store the data in the repository
More in lecture 3
• Web Ontology Language (OWL)
–
–
–
–
–
Is a more complex ontology language than RDFS
Layered language based on Description Logics
Overcomes some RDF(S) limitations
E.g. ontology language used to define the schemas used in repository
More in lecture 7
24
RDF Graph Encoding a Description of ABBA
25
RDF Serialized in RDF/XML
<?xml version=“1.0”>
<!DOCTYPE rdf:RDF[
<!ENTITY bbca “http://www.bbc.co.uk/music/artists/”>
<!ENTITY bbci “http://www.bbc.co.uk/music/images/artists/”>
<!ENTITY mba “http://musicbrainz.org/artist/”>]>
<rdf:RDF
xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:owl=“http://www.w3.org/2002/07/owl#”
xmlns:foaf=“http://xmlns.com/foaf/0.1/”
xmlns:mo=“http://purl.org/ontology/mo/”>
<mo:MusicArtist rdf:about=“http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist”>
<rdf:type rdf:resource=“http://purl.org/ontology/mo/MusicGroup”/>
<foaf:name>ABBA</foaf:name>
<foaf:homepage rdf:resource=“http://www.abbasite.com/”/>
<mo:image rdf:resource=“&bbci;542x305/d87e52c5-bb8d-4da8-b941-9f4928627dc8.jpg”>
<mo:member rdf:resource=“&bbca;042c35d3-0756-4804-b2c2-be57a683efa2#artist”>
<mo:member rdf:resource=“&bbca;2f031686-3f01-4f33-a4fc-fb3944532efa#artist”>
<mo:member rdf:resource=“&bbca;aebbb417-0d18-4fec-a2e2-ce9663d1fa7e#artist”>
<mo:member rdf:resource=“&bbca;ffb77292-9712-4d03-94aa-bdb1d4771d38#artist”>
<mo:musicbrainz rdf:resource=“&mba;d87e52c5-bb8d-4da8-b941-9f4928627dc8.html”>
<mo:wikipedia rdf:resource=“http://en.wikipedia.org/wiki/ABBA”>
<owl:sameAs rdf:resource=“http://dbpedia.org/resource/ABBA”>
</mo:MusicArtist>
</rdf:RDF>
26
RDF Serialized in Turtle
@prefix
@prefix
@prefix
@prefix
rdf:
owl:
foaf:
mo:
<http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://www.w3.org/2002/07/owl#> .
<http://xmlns.com/foaf/0.1/> .
<http://purl.org/ontology/mo/> .
<http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist>
rdf:type mo:MusicArtist, mo:MusicGroup ;
foaf:name “ABBA” ;
foaf:homepage <http://www.abbasite.com/> ;
mo:image <http://www.bbc.co.uk/music/images/artists/542x305/d87e52c5-bb8d-4da8-b941-9f4928627dc8.jpg> ;
mo:member <http://www.bbc.co.uk/music/artists/042c35d3-0756-4804-b2c2-be57a683efa2#artist> ,
<http://www.bbc.co.uk/music/artists/2f031686-3f01-4f33-a4fc-fb3944532efa#artist> ,
<http://www.bbc.co.uk/music/artists/aebbb417-0d18-4fec-a2e2-ce9663d1fa7e#artist> ,
<http://www.bbc.co.uk/music/artists/ffb77292-9712-4d03-94aa-bdb1d4771d38#artist> ;
mo:musicbrainz <http://musicbrainz.org/artist/d87e52c5-bb8d-4da8-b941-9f4928627dc8.html> ;
mo:wikipedia <http://en.wikipedia.org/wiki/ABBA> ;
owl:sameAs <http://dbpedia.org/resource/ABBA> .
27
RDFS and OWL Example
•
Reasoning example in RDFS
–
–
–
–
rdfs:subClassOf can model class hierarchies
mo:MusicGroup and mo:MusicArtist specify two classes
Axiom <mo:MusicGroup, rdfs:subClassOf, mo:MusicArtist>
Stating that ABBA is an instance of type MusicGroup enables reasoners to
conclude that ABBA is also an instance of type MusicArtist
– When query asks for all MusicArtists, then ABBA will be contained in query
result, even though there is no explicit assertion of this
•
Reasoning example in OWL
– owl:sameAs can be used to specify that two resources are identical
– To consolidate information about ABBA from multiple sources we can specify that
http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist
and
http://dbpedia.org/resource/ABBA
are the same
28
SPARQL and Rule Languages
•
SPARQL
–
–
–
–
–
•
Query language for RDF triples
A protocol for querying RDF data over the Web
E.g. language used to query the repository from the user interface
Can also be used for Updates
More in lecture 6
Rule languages (esp. Rule Interchange Format RIF)
– W3C recommendation for exchanging rule sets between rule engines
– Extend ontology languages with proprietary axioms
– Based on different types of logics
•
•
Description Logic
Logic Programming
– E.g. used to enable reasoning over data to infer new knowledge
– More in lecture 8
29
SPARQL Example
•
SPARQL query for other music groups that members of ABBA sing in
PREFIX
PREFIX
PREFIX
rdf:
foaf:
mo:
<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
<http://xmlns.com/foaf/0.1/>
<http://purl.org/ontology/mo/>
SELECT ?memberName ?groupName
WHERE {
<http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist>
mo:member
?m .
?x
mo:member
?m .
?x
rdf:type
mo:MusicGroup .
?m
foaf:name
?memberName .
?x
foaf:name
?groupName
}
FILTER (?groupName <> “ABBA”)
30
SPARQL Example
•
SPARQL query for other music groups that members of ABBA sing in
•
Graphical representation of WHERE clause
<http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist>
mo:member
?m .
?x
mo:member
?m .
?x
rdf:type
mo:MusicGroup .
?m
foaf:name
?memberName .
?x
foaf:name
?groupName
31
Two RIF rules for mapping FOAF predicates
•
True statements in antecedent of rule mean true statements in its conclusion
if {
then
{
?x foaf:firstName ?first;
foaf:surname ?last;
}
?x foaf:family_name ?last;
foaf:givenname ?first;
foaf:name func:string-join(?first “ ” ?last)
}
if {
then
{
}
?x foaf:name ?name } and
pred:contains(?name, “ ”) }
?x foaf:firstName func:strong-before(?name, “ ”);
foaf:surname
func:strong-after(?name, “ ”)
32
Logics, Proof and Trust
•
Security and Encryption
– HTTPs provides data integrity and confidentiality when transmitting data and queries
– Digital signing of RDF graphs provides authenticity and non-repudiation
•
Unifying logic
– Bring together the various ontology and rule languages
– Connect unlinked data to provide more meaning to data, and drive data integration
– E.g. identity management and alignment via http://sameas.org
•
Proof
– Explanation of inference results, data provenance
•
Trust
–
–
–
–
•
Trust that the system performs correctly
Trust that the system can explain what it is doing
Network of trust for data sources and services
Technology and user interface
Many open problems, topics for future research
33
Foundations
Rules:
RIF
34
More than a-z, A-Z
UNICODE
35
Character Sets
• ASCII – 7 bit, 128 characters (a-z, A-Z, 0-9, punctuation)
• Extension code pages – 128 chars (ß, Ä, ñ, ø, Š, etc.)
– Different systems, many different code pages
– ISO Latin 1, CP1252 – Western languages
– ISO Latin 2, CP1250 – East Europe
(197 = Å)
(197 = Ĺ)
• Code page is an interpretation, not a property of text
– Swedish programmer would have to write
ä aÄiÜ='Ön'; ü instead of
{ a[i]='\n'; }
• Thus if we do not interpret correctly the code page, the
result visualized will not be the expected one
36
UNICODE: an unambiguous code
•
We need a solution that can be unambiguously interpreted, i.e. whether
a code corresponds to a single character and vice versa
•
That’s why UNICODE was created!
$Å Ĺ Æ ή
U+0024 U+00C5 U+0139 U+00C6
U+03AE
‫ ♥⅝ك‬Жญ
U+0643 U+215D U+2665
U+0416
U+0E0D
37
UNICODE
• ISO standard
– About 100,000 characters, space for 1,000,000
– Unique code points from U-0000 through U-FFFF to U-10FFFF
– Well-defined process for adding characters
• When dealing with any text, simply use UNICODE
– Character code charts: http://www.unicode.org/charts/
• See also:
– http://www.tbray.org/talks/rubyconf2006.pdf
– http://tbray.org/ongoing/When/200x/2003/04/06/Unicode
38
How to identify things on the Web
URI: UNIFORM
RESOURCE IDENTIFIERS
39
Identifier, Resource, Representation
Taken from http://www.w3.org/TR/webarch/
40
URI, URN, URL
•
A Uniform Resource Identifier (URI) is a string of characters used to identify
a name or a resource on the Internet
•
•
A URI can be a URL or a URN
A Uniform Resource Name (URN) defines an item's identity
–
•
the URN urn:isbn:0-395-36341-1 is a URI that specifies the identifier system, i.e.
International Standard Book Number (ISBN), as well as the unique reference within that
system and allows one to talk about a book, but doesn't suggest where and how to obtain an
actual copy of it
A Uniform Resource Locator (URL) provides a method for finding it
–
the URL http://www.auckland.ac.nz identifies a resource (UoA's home page) and implies that
a representation of that resource (such as the home page's current HTML code, as encoded
characters) is obtainable via HTTP from a network host named www.auckland.ac.nz
41
URI Syntax
•
Examples
–
–
–
–
•
http://www.ietf.org/rfc/rfc3986.txt
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
telnet://melvyl.ucop.edu/
URI Syntax
–
–
–
–
–
scheme: [//authority] [/path] [?query] [#fragid]
The scheme distinguishes different kinds of URIs
Authority normally identifies a server
Path normally identifies a directory and a file
Query adds extra parameters
Fragment ID identifies a secondary resource
42
URI Syntax cont’d
•
Reserved characters (like /:?#@$&+* )
•
Many allowed characters
•
Rest percent-encoded by UTF-8
– http://google.com/search?q=technikerstra%C3%9Fe
•
IRI – Internationalized Resource Identifier
– Allows whole UNICODE
– Specifies transformation into URI – mostly UTF-8 encoding
43
URI Schemes
Scheme Description
• Schemes partition
the URI space
into subspaces
• Schemes can add
or clarify properties
of resources
– Ownership
(how authorities
are formed)
– Persistence
(how stable
the URIs should be)
– Protocol
(default access protocol)
RFC
file
Host-specific file names
[1738]
ftp
File Transfer Protocol
[1738]
http
Hypertext Transfer Protocol
[2616]
https
Hypertext Transfer Protocol Secure
[2818]
Instant Messaging
[3860]
internet message access protocol
[5092]
ipp
Internet Printing Protocol
[3510]
iris
Internet Registry Information Service
[3981]
ldap
Lightweight Directory Access Protocol
[4516]
Electronic mail address
[2368]
message identifier
[2392]
im
imap
mailto
mid
From http://www.iana.org/assignments/uri-schemes.html
44
How to exchange structured data on the Web
XML: EXTENSIBLE
MARKUP LANGUAGE
45
eXtensible Markup Language
• Language for creating languages
– “Meta-language”
– XHTML is a language: HTML expressed in XML
• W3C Recommendation (standard)
– XML is, for the information industry,
what the container is for international shipping
– For structured and semistructured data
• Main plus: wide support, interoperability
– Platform-independent
• Applying new tools to old data
46
Structure of XML Documents
• Elements, attributes, content
• One root element in document
• Characters, child elements in content
47
XML Element
• Syntax <name>contents</name>
– <name> is called the opening tag
– </name> is called the closing tag
• Examples
<gender>Female</gender>
<story>Once upon a time there was…. </story>
• Element names case-sensitive
48
Attributes to XML Elements
• Name/value pairs, part of element contents
• Syntax
<name attribute_name="attribute_value">contents</name>
• Values surrounded by single or double quotes
• Example
<temperature unit="F">64</temperature>
<swearword language='fr'>con</swearword>
49
Empty Elements
• Empty element: <name></name>
• This can be shortened: <name/>
• Empty elements may have attributes
• Example
<grade value='A'/>
50
Comments
• May occur anywhere in element contents or outside the
root element
• Start with <!-• End with -->
• May not contain a double hyphen
• Comments cannot be nested
• Example:
<element>content
<!-- a comment, will be ignored in processing -->
</element>
<!-- comment outside the root element -->
51
Nesting Elements
• Elements may contain other (child) elements
– The containing element is the parent element
• Elements must be properly nested
• Example with improper nesting:
<b>bold <i>bold-italic</b> italic?</i>
• The above is not XML (not well-formed)
52
Special Characters in XML
• < and > are obviously reserved in content
– Written as < and >
• Same for ' and " in attribute values
– Written as ' and "
• Now & is also reserved
– Written as &
• Any character: ß or ß
 ß
– Decimal or hexa-decimal unicode code point
• Elements and attributes whose name starts with “xml”
are also special
53
Uses of XML
• Document mark-up – XHTML
– HTML is a language, so it can be expressed in XML
• Exchanged data
–
–
–
–
Scalable vector graphics – SVG
E-commerce – ebXML
Messaging in general – SOAP
And many more standards
• Internal data
– Databases
– Configuration files
• Etc.
54
Why XML?
• For semistructured data:
– Loose but constrained structure
– Unspecified content length
• For structured data:
–
–
–
–
Table(s) or similar rows
Well-defined structure, data types
Good interoperability
But: requirements for quick access, processing
55
XML Parsers
• Document Object Model (DOM) builder
– Creates an object model of XML document, tree-traversal API
– In-memory representation, random access
– DOM complex, simpler JDOM etc.
• Simple API for XML parsing (SAX)
–
–
–
–
Views XML as stream of events
el_start("date"), attribute("day", "10"), el_end("date")
Content reported as callback to methods on handler object of design
DOM builder can use SAX
• Pull parsers
– Intermediate parsed results can be accessed as local variables
– StAX (JAVA), XMLReader (PHP), System.XML.XMLReader (.NET)
56
How to distinguish categories of resources
NAMESPACES
57
The Problem
• Documents use different vocabularies
– Example 1: CD music collection
– Example 2: online order transaction
• Merging multiple documents together
– Name collisions can occur
• Example 1: albums have a <name>
• Example 2: customers have a <name>
– How do you differentiate between the two?
58
The Solution: Namespaces!
• What is a namespace?
– A syntactic way to differentiate similar names in an XML document
• Binding namespaces
– Uses Uniform Resource Identifier (URI)
• e.g. “http://example.com/NS”
– Can bind to a named or “default” prefix
59
Namespace Binding Syntax
• Use “xmlns” attribute
– Named prefix
• <a:foo xmlns:a=‘http://example.com/NS’/>
– Default prefix
• <foo xmlns=‘http://example.com/NS’/>
• Element and attribute names are “qualified”
– URI, local part (or “local name”) pair
• e.g. { “http://example.com/NS” , “foo” }
60
Example Document I
•
Namespace binding
<?xml version=‘1.0’ encoding=‘UTF-8’?>
<order>
<item code=‘BK123’>
<name>Care and Feeding of Wombats</name>
<desc xmlns:html=‘http://www.w3.org/1999/xhtml’>
The <html:b>best</html:b> book ever written!
</desc>
</item>
</order>
61
Example Document II
•
Namespace scope
<?xml version=‘1.0’ encoding=‘UTF-8’?>
<order>
<item code=‘BK123’>
<name>Care and Feeding of Wombats</name>
<desc xmlns:html=‘http://www.w3.org/1999/xhtml’>
The <html:b>best</html:b> book ever written!
</desc>
</item>
</order>
62
Example Document III
•
Bound elements
<?xml version=‘1.0’ encoding=‘UTF-8’?>
<order>
<item code=‘BK123’>
<name>Care and Feeding of Wombats</name>
<desc xmlns:html=‘http://www.w3.org/1999/xhtml’>
The <html:b>best</html:b> book ever written!
</desc>
</item>
</order>
63
How to define XML document structures
XML SCHEMA
64
What is it?
• A grammar definition language
– More expressive than Document Type Definitions (DTDs)
• Uses XML syntax
– Defined by W3C
• Primary features
– Datatypes
• e.g. integer, float, date, etc…
– More powerful content models
• e.g. namespace-aware, type derivation, etc…
65
XML Schema Types
• Simple types
– Basic datatypes
– Can be used for attributes and element text
– Extendable
• Complex types
– Defines structure of elements
– Extendable
• Types can be named or “anonymous”
66
Simple Types
• DTD datatypes
– Strings, ID/IDREF, NMTOKEN, etc…
• Numbers
– Integer, long, float, double, etc…
• Other
– Binary (base64, hex)
– QName, URI, date/time
– etc…
67
Deriving Simple Types
• Apply facets
– Specify enumerated values
– Add restrictions to data
– Restrict lexical space
• Allowed length, pattern, etc…
– Restrict value space
• Minimum/maximum values, etc…
• Extend by list or union
68
A Simple Type Example
• Integer with value (1234, 5678]
<xsd:simpleType name=‘MyInteger’>
<xsd:restriction base=‘xsd:integer’>
<xsd:minExclusive value=‘1234’/>
<xsd:maxInclusive value=‘5678’/>
</xsd:restriction>
</xsd:simpleType>
69
A Simple Type Example II
• Validating integer with value (1234, 5678]
<data xsi:type='MyInteger'></data>
<data xsi:type='MyInteger'>Andy</data>
<data xsi:type='MyInteger'>-32</data>
<data xsi:type='MyInteger'>1233</data>
<data xsi:type='MyInteger'>1234</data>
<data xsi:type='MyInteger'>1235</data>
<data xsi:type='MyInteger'>5678</data>
<data xsi:type='MyInteger'>5679</data>
INVALID
INVALID
INVALID
INVALID
INVALID
INVALID
70
Complex Types
• Element content models
– Simple
– Mixed
• Unlike DTDs, elements in mixed content can be ordered
– Sequences and choices
• Can contain nested sequences and choices
– All
• All elements required but order is not important
71
A Complex Type Example I
• Mixed content that allows <b>, <i>, and <u>
<xsd:complexType name=‘RichText’ mixed=‘true’>
<xsd:choice minOccurs=‘0’ maxOccurs=‘unbounded’>
<xsd:element name=‘b’ type=‘RichText’/>
<xsd:element name=‘i’ type=‘RichText’/>
<xsd:element name=‘u’ type=‘RichText’/>
</xsd:choice>
</xsd:complexType>
72
A Complex Type Example II
• Validation of RichText
<content xsi:type='RichText'></content>
<content xsi:type='RichText'>Andy</content>
<content xsi:type='RichText'>XML is
<i>awesome</i>.</content>
<content
xsi:type='RichText'><B>bold</B></content>
INVALID
<content xsi:type='RichText'><foo/></content>
INVALID
73
EXTENSIONS
74
Building On The Foundations
• RDF for semantic data
– Graphs of linked data
– Semantic Web
• Any XML or HTML can support translation to RDF
– GRDDL: a pointer to a transformation
– RDFa: RDF in XHTML
– Makes existing data part of the Semantic Web
• XML has encryption and digital signature
– Necessary technologies for data provenance, trust
75
Web of Linked Data
76
RDF Teaser
• Resource Description Framework
– Metadata: about Web resources
– But also any other data
• Graphs of resources interlinked with properties
Rafael Nadal
Shakira
plays
knows
sings
Tennis
Shakira
Waka waka
• Ontology languages for data schemas
– Various properties: knows, plays, sings
– Classes of resources: Person, Athlete, Singer, Sport, Song
• SPARQL for querying the data
77
The Semantic Web architecture in practice
ILLUSTRATION BY
A LARGER EXAMPLE
78
Semantic Conference
•
All the data about the conference is part of the Semantic Web
–
–
–
–
Date, location
Organizers, peer-review committees
Articles (papers), their authors
Detailed program schedule
•
Each Semantic Web architecture
layer plays a role
•
ISWC is annotating conference data using
Semantic Web technologies
– http://data.semanticweb.org/conference/iswc/2011/html
– Currently available data regards only papers and authors
– This could be extended to support features discussed above
79
Foundation Layers
•
UNICODE
– All participants' names should be in UNICODE because they are international: Denny
Vrandečić, Diego Meroño, François Maué
– Same for paper titles: "α-decay and β-decay of heavy atoms"
•
URI: All important things must have identifiers, for example:
– Conference: http://data.semanticweb.org/conference/iswc/2011
– Participant: http://data.semanticweb.org/person/piero-bonatti
– Participant's affiliation: http://data.semanticweb.org/organization/talis-informationlimited
– Paper: http://data.semanticweb.org/conference/iswc/2011/paper/tutorial/7
80
Data Layers
•
XML
– The HTML pages should be in XHTML
– The RDF data (below) should be in RDF/XML
– News feed should be in Atom (an XML format)
•
RDF
– The conference dataset, and any useful subsets, should be published in RDF for
download; for example:
•
http://data.semanticweb.org/conference/iswc/2011/rdf
81
Ontologies, Query
•
RDFS, OWL
–
–
–
–
•
The conference would use various vocabularies and ontologies, such as:
FOAF (Friend of a friend) for talking about the attendees and authors/presenters
Dublin Core for paper metadata
Calendar ontology for the program
SPARQL
– The conference server should have a public SPARQL endpoint that can be used for
queries over the conference data
– http://data.semanticweb.org/snorql/
82
Browsing ISWC Data
http://data.semanticweb.org/person/tom-heath/html
83
Querying ISWC Data
http://data.semanticweb.org/snorql/
84
That’s almost all for today…
SUMMARY
85
Things to Keep in Mind
• Semantic Web builds on the Web
• For any text, use UNICODE, probably UTF-8
• URIs can identify anything
– Not only documents on the Web
• XML helps with data exchange, interoperability
• XML languages are distinguished with namespaces
86
References
•
Mandatory:
– http://www.w3.org/TR/webarch/
– http://www.w3.org/DesignIssues/Architecture.html
•
Further reading:
–
–
–
–
–
–
–
–
http://www.w3.org/Provider/Style/URI
http://www.ietf.org/rfc/rfc3986.txt
http://www.unicode.org/charts/
http://www.tbray.org/talks/rubyconf2006.pdf
http://tbray.org/ongoing/When/200x/2003/04/06/Unicode
http://www.w3.org/TR/xml/
http://www.w3.org/TR/xml-names/
http://www.w3.org/TR/xmlschema-1/
– Fensel et al., On2broker: Semantic-Based Access to Information Sources at
the WWW
– Fensel et al.: Ontobroker in a Nutshell
– http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
– http://www.w3.org/DesignIssues/Semantic.html
87
References
•
Wikipedia links:
•
•
•
•
•
•
•
•
•
•
http://en.wikipedia.org/wiki/Semantic_Web_Stack
http://en.wikipedia.org/wiki/URI
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/XML
http://en.wikipedia.org/wiki/XML_Namespaces
http://en.wikipedia.org/wiki/Resource_Description_Framework
http://en.wikipedia.org/wiki/RDF_Schema
http://en.wikipedia.org/wiki/Web_Ontology_Language
http://en.wikipedia.org/wiki/SPARQL
http://en.wikipedia.org/wiki/Rule_Interchange_Format
88
Next Lecture
#
Title
1
Introduction
2
Semantic Web Architecture
3
Resource Description Framework (RDF)
4
Web of Data
5
Generating Semantic Annotations
6
Storage and Querying
7
Web Ontology Language (OWL)
8
Rule Interchange Format (RIF)
89
Questions?
90
90
Download