Creating and Exploiting a Web of Semantic Data

advertisement
Creating and
Exploiting a Web
of Semantic Data
Overview
•Introduction
•Semantic Web 101
•Recent Semantic Web trends
•Examples: DBpedia, Wikitology
•Conclusion
The Age of Big Data
•Massive amounts of data is available today
•Advances in many fields driven by availability of
unstructured data, e.g., text, audio, images
•Increasingly, large amounts of structured and
semi-structured data is also online
•Much of this available in the Semantic Web
language RDF, fostering integration and
interoperability
•Such structured data is especially important for
the sciences
Twenty years ago…
Tim Berners-Lee’s 1989 WWW
proposal described a web of relationships among named objects
unifying many information
management tasks
Capsule history
• Guha’s MCF (~94)
• XML+MCF=>RDF (~96)
• RDF+OO=>RDFS (~99)
• RDFS+KR=>DAML+OIL (00)
• W3C’s SW activity (01)
• W3C’s OWL (03)
• SPARQL, RDFa (08)
• Rules (09)
http://www.w3.org/History/1989/proposal.html
Ten years ago ….
•The W3C started
developing standards
for the Semantic Web
•The vision, technology
and use cases are still
evolving
•Moving from a web of
documents to a web of
data
Today
4.5 billion integrated facts
published on the Web as
RDF Linked Open Data
Tomorrow
Large collections of
integrated facts published
on the Web for many
disciplines and domains
W3C’s Semantic Web Goal
“The Semantic Web is an extension of
the current web in which information
is given well-defined meaning, better
enabling computers and people to
work in cooperation.”
-- Berners-Lee, Hendler and Lassila, The
Semantic Web, Scientific American, 2001
Contrast with a non-Web approach
The W3C Semantic Web approach is
•Distributed
•Open
•Non-proprietary
•Standards based
How can we share data on the Web?
•POX, Plain Old XML, is one approach, but it has
deficiencies
•The Semantic Web languages RDF and OWL
offer a simpler and more abstract data model
(a graph) that is better for integration
•Its well defined semantics supports knowledge
modeling and inference
•Supported by a stable, funded standards
organization, the World Wide Web Consortium
Simple RDF Example
dc:Title “Intelligent Information Systems
on the Web and in the Aether”
http://umbc.edu/
~finin/talks/idm02/
dc:Creator
Note: “blank node”
bib:Aff
http://umbc.edu/
bib:name
“Tim Finin”
bib:email
“finin@umbc.edu”
The RDF Data Model
•An RDF document is an unordered collection of
statements, each with a subject, predicate and
object
•Such triples can be thought of as a labelled arc
in a graph
•Statements describe properties of resources
•A resource is any object that can be referenced
or denoted by a URI
•Properties themselves are also resources (URIs)
•Dereferencing a URI produces useful additional
information, e.g., a definition or additional facts
RDF is the first SW language
Graph
XML Encoding
<rdf:RDF ……..>
<….>
<….>
</rdf:RDF>
Good for
Machine
processing
RDF
Data Model
Triples
stmt(docInst, rdf_type, Document)
stmt(personInst, rdf_type, Person)
stmt(inroomInst, rdf_type, InRoom)
stmt(personInst, holding, docInst)
stmt(inroomInst, person, personInst)
Good for storage
and reasoning
Good for
human
viewing
RDF is a simple
language for graph
based
representations
XML encoding for RDF
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:bib="http://daml.umbc.edu/ontologies/bib/">
<description about="http://umbc.edu/~finin/talks/idm02/">
<dc:title>Intelligent Information … and in the Aether</dc:Title>
<dc:creator>
<description>
<bib:Name>Tim Finin</bib:Name>
<bib:Email>finin@umbc.edu</bib:Email>
<bib:Aff resource="http://umbc.edu/" />
</description>
</dc:Creator>
</description>
</rdf:RDF>
dc:Title
http://umbc.edu/
~finin/talks/idm02/
“Intelligent Information Systems
on the Web and in the Aether”
dc:Creator
bib:Aff
http://umbc.edu/
bib:name
“Tim Finin”
bib:email
“finin@umbc.edu”
N3 is a friendlier encoding
@prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# .
@prefix dc: http://purl.org/dc/elements/1.1/ .
@prefix bib: http://daml.umbc.edu/ontologies/bib/ .
<http://umbc.edu/~finin/talks/idm02/>
dc:title "Intelligent ... and in the Aether" ;
dc:creator
[ bib:Name "Tim Finin";
bib:Email "finin@umbc.edu"
bib:Aff: "http://umbc.edu/" ] .
dc:Title
http://umbc.edu/
~finin/talks/idm02/
http://umbc.edu/
“Intelligent Information Systems
on the Web and in the Aether”
dc:Creator
bib:Aff
bib:name
“Tim Finin”
bib:email
“finin@umbc.edu”
RDFS supports simple inferences
• RDF Schema adds vocabulary for classes, properties & constraints
• An RDF ontology plus some RDF statements may imply additional
RDF statements (not possible in XML)
• Note that this is part of the data model and not of the accessing or
processing code.
@prefix rdfs: <http://www.....>.
@prefix : <genesis.n3>.
parent a rdf: property;
rdfs:domain person;
rdfs:range person.
mother rdfs:subProperty parent;
rdfs:domain woman;
rdfs:range person.
eve mother cain.
person a class.
woman subClass person.
mother a property.
eve a person;
a woman;
parent cain.
cain a person.
OWL adds further richness
OWL adds richer representational vocabulary, e.g.
– parentOf is the inverse of childOf
– Every person has exactly one mother
– Every person is a man or a woman but not both
– A man is the equivalent of a person with a sex
property with value “male”
OWL is based on ‘description logic’ – a logic subset
with efficient reasoners that are complete
– Good algorithms for reasoning about descriptions
That was then, this is now
• 1996-2000: focus on RDF and data
• 2000-2007: focus on OWL,
developing ontologies, sophisticated
reasoning
• 2008-…: Integrating and exploiting
large RDF data collections backed by
lightweight ontologies
A Linked Data story
•Wikipedia as a source of knowledge
–Wikis are a great ways to collaborate
on building up knowledge resources
•Wikipedia as an ontology
–Every Wikipedia page is a concept or object
•Wikipedia as RDF data
–Map this ontology into RDF
•DBpedia as the lynchpin for Linked Data
–Exploit its breadth of coverage to integrate things
Populating Freebase KB
Underlying Powerset’s KB
Mined by TrueKnowledge
Wikipedia as an ontology
• Using Wikipedia as an ontology
–each article (~3M) is an ontology concept or instance
–terms linked via category system (~200k), infobox template
use, inter-article links, infobox links
–Article history contains metadata for trust, provenance, etc.
• It’s a consensus ontology with broad coverage
• Created and maintained by a diverse community for
free!
• Multilingual
• Very current
• Overall content quality is high
Wikipedia as an ontology
•Uncategorized and miscategorized articles
•Many ‘administrative’ categories: articles
needing revision; useless ones: 1949 births
•Multiple infobox templates for the same
class
•Multiple infobox attribute names for same
property
•No datatypes or domains for infobox
attribute values
• etc.
Dbpedia : Wikipedia in RDF
•A community effort to extract
structured information from
Wikipedia and publish as RDF
on the Web
•Effort started in 2006 with EU funding
•Data and software open sourced
•DBpedia doesn’t extract information from
Wikipedia’s text, but from the its structured
information, e.g., links, categories, infoboxes
DBpedia: Linked Data lynchpin
http://lookup.dbpedia.org/
Dbpedia uses WP structured data
DBpedia extracts structured data from
Wikipedia, especially from Infoboxes
Dbpedia ontology
• Dbpedia 3.2 (Nov 2008) added a manually
constructed ontology with
Place
–170 classes in a subsumption hierarchy
–880K instances
– 940 properties with domain and range
248,000
Person 214,000
Work
193,000
Species 90,000
Org.
76,000
Building 23,000
• A partial, manual mapping was constructed from
infobox attributes to these term
• Current domain and range constraints are “loose”
• Namespace: http://dbpedia.org/ontology/
Person
56 properties
Organisation
50 properties
Place
110 properties
PREFIX dbp: <http://dbpedia.org/resource/>
PREFIX dbpo: <http://dbpedia.org/ontology/>
SELECT distinct ?Property ?Place
WHERE {dbp:Barack_Obama ?Property ?Place .
?Place rdf:type dbpo:Place .}
http://dbpedia.org/sparql/
DBpedia: Linked Data lynchpin
Consider Baltimore, MD
Looking at the RDF description
We find assertions equating DBpedia's object for
Baltimore with those in other LOD datasets:
dbpedia:Baltimore%2C_Maryland
owl:sameAs census:us/md/counties/baltimore/baltimore;
owl:sameAs cyc:concept/Mx4rvVin-5wpEbGdrcN5Y29ycA;
owl:sameAs freebase:guid.9202a8c04000641f800000000004921a;
owl:sameAs geonames:4347778/ .
Since owl:sameAs is defined as an equivalence
relation, the mapping works both ways
Linked Data Cloud, March 2009
Four principles for linked data
• Use URIs to identify things that you expose to
the Web as resources
• Use HTTP URIs so that people can locate and
look up (dereference) these things.
• When someone looks up a URI, provide useful
information
• Include links to other, related URIs in the
exposed data as a means of improving
information discovery on the Web
-- Tim Berners-Lee, 2006
4.5 billion triples for free
•The full public LOD dataset has about 4.5
billion triples as of March 2009
•Linking assertions are spotty, but probably
include order 10M equivalences
•Availability:
–download the data in RDF
–Query it via a public SPARQL servers
– load it as an Amazon EC2 public dataset
–Launch it and required software as an Amazon
public AMI image
Wikitology
We’ve been exploring a different approach to
derive an ontology from Wikipedia through a
series of use cases:
– Identifying user context in a collaboration system from
documents viewed (2006)
– Improve IR accuracy by adding Wikitology tags to
documents (2007)
– ACE: cross document co-reference resolution for named
entities in text (2008)
– TAC KBP: Knowledge Base population from text (2009)
– Improve Web search engine by tagging documents and
queries (2009)
Wikitology 2.0 (2008)
RDF
Freebase KB
Databases
RDF
graphs
text
Yago
Human input & editing
WordNet
Wikitology tagging
•Using Serif’s output, we produced an entity
document for each entity.
Included the entity’s name, nominal and pronominal
mentions, APF type and subtype, and words in a
window around the mentions
•We tagged entity documents using Wikitology producing vectors of (1) terms and (2)
categories for the entity
•We used the vectors to compute features
measuring entity pair similarity/dissimilarity
Wikitology Entity Document & Tags
Wikitology entity document
<DOC>
<DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO>
<TEXT>
Name
Webb Hubbell
PER
Type & subtype
Individual
NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell"
PRO: "he” "him” "his"
Mention heads
abc's accountant after again ago all alleges alone also and arranged
attorney avoid been before being betray but came can cat charges cheating
circle clearly close concluded conspiracy cooperate counsel counsel's
department did disgrace do dog dollars earned eightynine enough evasion
feel financial firm first four friend friends going got grand happening has he
help him hi s hope house hubbell hubbells hundred hush income increase
independent indict indicted indictment inner investigating jackie
jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie
little make many mickey mid money mr my nineteen nineties ninetyfour
not nothing now office other others paying peter_jennings president's
pressure pressured probe prosecutors questions reported reveal rock
saddened said schemed seen seven since starr statement such tax taxes tell
them they thousand time today ultimately vernon washington webb
webb_hubbell were what's whether which white whitewater why wife
years
</TEXT>
</DOC>
Words surrounding
mentions
Wikitology article tag vector
Webster_Hubbell 1.000
Hubbell_Trading_Post National Historic Site 0.379
United_States_v._Hubbell 0.377
Hubbell_Center 0.226
Whitewater_controversy 0.222
Wikitology category tag vector
Clinton_administration_controversies 0.204
American_political_scandals 0.204
Living_people 0.201
1949_births 0.167
People_from_Arkansas 0.167
Arkansas_politicians 0.167
American_tax_evaders 0.167
Arkansas_lawyers 0.167
Top Ten Features (by F1)
Prec.
Recall
F1
Feature Description
90.8%
76.6%
83.1%
some NAM mention has an exact match
92.9%
71.6%
80.9%
Dice score of NAM strings (based on the intersection of NAM
strings, not words or n-grams of NAM strings)
95.1%
65.0%
77.2%
the/a longest NAM mention is an exact match
86.9%
66.2%
75.1%
Similarity based on cosine similarity of Wikitology Article
Medium article tag vector
86.1%
65.4%
74.3%
Similarity based on cosine similarity of Wikitology Article
Long article tag vector
64.8%
82.9%
72.8%
Dice score of character bigrams from the 'longest' NAM
string
95.9%
56.2%
70.9%
all NAM mentions have an exact match in the other pair
85.3%
52.5%
65.0%
Similarity based on a match of entities' top Wikitology article
tag
85.3%
52.3%
64.8%
Similarity based on a match of entities' top Wikitology article
tag
85.7%
32.9%
47.5%
Pair has a known alias
Knowledge Base Population
•The 2009 NIST Text Analysis Conference (TAC)
will include a new Knowledge Base Population
track
•Goal: discover information about named
entities (people, organizations, places) and
incorporate it into a KB
•TAC KBP has two related tasks:
–Entity linking: doc. entity mention -> KB entity
–Slot filling: given a document entity mention, find
missing slot values in large corpus
KBs and IE are Symbiotic
KB info helps interpret text
Knowledge
Base
Information
Extraction
from Text
IE helps populate KBs
Wikitology 3.0
(2009)
Application
Specific
Algorithms
Application
Specific
Algorithms
IR
collection
Articles
Wikitology
Code
RDF
reasoner
Application
Specific
Algorithms
Relational
Database
Triple
Store
Category
InfoboxLinks
GraphGraph
Infobox
Page Link
Graph
Graph
Linked
Semantic
Web data &
ontologies
Wikipedia’s social network
•Wikipedia has an implicit ‘social network’ that
can help disambiguate PER mentions
•Resolving PER mentions in a short document to
KB people who are linked in the KB is good
•The same can be done for the network of ORG
and GPE entities
WSN Data
•We extracted 213K people from the DBpedia’s
Infobox dataset, ~30K of which participate in
an infobox link to another person
•We extracted 875K people from Freebase,
616K of were linked to Wikipedia pages, 431K
of which are in one of 4.8M person-person
article links
•Consider a document that mentions two
people: George Bush and Mr. Quayle
Which Bush & which Quayle?
Six George Bushes
Nine Male Quayles
A simple closeness metric
Let Si = {two hop neighbors of Si}
Cij = |intersection(Si,Sj)| / |union(Si,Sj) |
Cij>0 for six of the 56 possible pairs
0.43 George_H._W._Bush -- Dan_Quayle
0.24 George_W._Bush -- Dan_Quayle
0.18 George_Bush_(biblical_scholar) -- Dan_Quayle
0.02 George_Bush_(biblical_scholar) -- James_C._Quayle
0.02 George_H._W._Bush -- Anthony_Quayle
0.01 George_H._W._Bush -- James_C._Quayle
Application to TAC KBP
•Using entity network data extracted from
Dbpedia and Wikipedia provides evidence to
support KBP tasks:
–Mapping document mentions into infobox
entities
–Mapping potential slot fillers into infobox
entities
–Evaluating the coherence of entities as
potential slot fillers
Conclusion
•The Semantic Web approach is a powerful
approach for data interoperability and integration
•The research focus is shifting to a “Web of Data”
perspective
•Many research issue remain: uncertainty,
provenance, trust, parallel graph algorithms,
reasoning over billions of triples, user-friendly
tools, etc.
•Just as the Web enhances human intelligence, the
Semantic Web will enhance machine intelligence
•The ideas and technology are still evolving
http://ebiquity.umbc.edu/
Download