Download Part II

advertisement
Semantic Web
&
Semantic Web Processes (continued, Part II)
A course at Universidade da Madeira, Funchal, Portugal
June 16-18, 2005
Dr. Amit P. Sheth
Professor, Computer Sc., Univ. of Georgia
Director, LSDIS lab
CTO/Co-founder, Semagix, Inc
Special Thanks: Cartic Ramakrishnan, Karthik Gomadam
Part II
• Metadata, Enabling techniques and
technologies
• Automated metadata extraction and
annotation
• Computation and reasoning with focus on
relationships
• Example commercial Semantic Web platform
for building ontology-driven applicaions
What is Metadata?
• Data about data
– Statements, contexts
– Recursive – data about “data about data”
• Applications
– Content management
– Cataloguing
– Information retrieval, search
–…
"A Web content repository without metadata is like a library without an
index," - Jack Jia, IWOV
A continuum – from data to knowledge
Metadata extraction from heterogeneous content/data
WWW, Enterprise
Repositories
Nexis
UPI
AP
Feeds/
Documents
Digital Videos
...
...
Data Stores
Digital Maps
...
Digital Images
Create/extract as much (semantics)
metadata automatically as possible, from:
Any format (HTML, XML, RDB, text, docs)
Many media
Push, pull
Proprietary, Deep Web, Open Source
Digital Audios
EXTRACTORS
METADATA
Extracting a Text Document:
Syntactic approach
INCIDENT MANAGEMENT SITUATION REPORT
Friday August 1, 1997 - 0530 MDT
LAYOUT
NATIONAL PREPAREDNESS LEVEL II
CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have been
staffed for structure protection.
SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between Galena and McGr
The fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. The
fire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is
35% contained, while protection of the historic cabit continues.
Date => day month int ‘,’ int
CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Wehking) is
assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire is
contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fire
burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend,
depending on the results of infrared scanning.
Organizing Information:
Automatic Classification
Traditional Text
Categorization
Customer
Training
Set
Statistical/AI
Techniques
Classify
Place in
a taxonomy
Routing/Distribution
Customer
Article Feed
4715
7/1/2016
Classification of
Article 4715
Standard Metadata
Feed Source: iSyndicate
Posted Date: 11/20/2000
Taalee’s Categorization & Automatic Metadata Creation
Knowledge-base &
Statistical/AI Techniques
Taalee
Training
Set
Classify
Place in
a taxonomy
Catalog
Metadata
Automated Content
Enrichment (ACE)
FTE
Article 4715 Metadata
Standard
metadata
Customer
Training
Set
Semantic
metadata
Feed Source: iSyndicate
Posted Date: 11/20/2000
Company Name: France Telecom,
Equant
Ticker Symbol: FTE, ENT
Exchange: NYSE
Topic: Company News
Company Analysis
Conference Calls
Earnings
Stock Analysis
ENT
Company Analysis
Conference Calls
Earnings
Stock Analysis
NYSE
Member Companies
Market News
IPOs
Classification
of Article 4715
Article Feed
4715
Taalee Enterprise
Content Manager
Customization Suite
Precise
syndication/filtering
Routing/Distribution
Map to another taxonomy
Automatic Categorization & Metadata Tagging
Video Segment
with Associated Text
ABSOLUTE CONTROL OF THE SENATE IS
STILL IN QUESTION. AS OF TONIGHT, THE
REPUBLICANS HAVE 50 SENATE SEATS AND
THE DEMOCRATS 49. IN WASHINGTON STATE,
THE SENATE RACE REMAINS TOO CLOSE TO
CALL. IF THE DEMOCRATIC CHALLENGER
UNSEATS THE REPUBLICAN IUMBENT THE
SENATE WILL BE EVENLY DIVIDED. IN
MISSOURI, REPUBLICAN SENATOR JOHN
ASHCROFT SAYS HE WILL NOT CHALLENGE
HIS LOSS TO GOVERNOR MEL CARNAHAN
WHO DIED IN A CRASH THREE WEEKS AGO.
GOVERNOR CARNAHAN'S WIFE IS EXPECTED
TO TAKE HIS PLACE. IN THE HIGHEST PROFILE
SENATE EVENT OF THE NIGHT, HILLARY
CLINTON WON THE NEW YORK SENATE SEAT.
SHE IS THE FIRST FIRST LADY TO RUN MUCH
LESS WIN.
Segment Description
Auto
Categorization
Semantic
Metadata
Automatic Categorization & Metadata
Tagging (Web page)
Video with
Editorialized
Text on the Web
Auto
Categorization
Semantic Metadata
Automatic Categorization & Metadata
Tagging (Feed)
Text
From
Bllomberg
Auto
Categorization
Semantic Metadata
Taalee Extraction and Knowledgebase Enhancement
Web Page
Enhanced Metadata Asset
Extraction
Agent
Taalee, Inc.
1999-2002
Sheth et al, 2002 Managing Semantic Content for the Web
Semantic Enhancement Server
Semantic Enhancement
Server: Semantic Enhancement
Server classifies content into the
appropriate topic/category (if not
already pre-classified), and
subsequently performs entity
extraction and content
enhancement with semantic
metadata from the Semagix
Freedom Ontology
How does it work?
• Uses a hybrid of statistical,
machine learning and
knowledge-base techniques for
classification
• Not only classifies, but also
enhances semantic metadata
with associated domain
knowledge
© Semagix, Inc.
Ambiguity Resolution during Metadata Extraction from content text
Document
----------------
Ontology
lookup
Entity
Candidate
SES
Find Entity Candidates in the document:
 Names and Synonyms
 Common variations (Jr, Sr, III, PLC, .com, etc.)
...
Note: Entity Candidates can be restricted to a relevant subset of ontology
Resolve ambiguities for the entity using any/all of
No
Multiple matches
found during
entity lookup?
Yes
these criteria:
 Direct/Indirect relationships with other entities found
 Proximity analysis of related entities
 Entity refinement using subset analysis (‘Doe’ vs. ‘John Doe’)
ambiguity resolved
 List relationships between identified entities in same document (optional in output)
 List relationship trails e.g.
 CompExec  position  CompanyName
 Politician  party  country  watchList
Overcoming the key issue of resolving ambiguities in
facts & evidence
• Aggregation and normalization of any type of fact and evidence
into the domain ontology
–
Resolution of issues over terminology
• i.e. “Benefit number” is an alias of “SSN”
–
Resolution of issues over identity
• i.e. is executive “Larry Levy” an existing entity or a
new entity?
–
Enabling decisions to be made on the trustworthiness of
existing facts
• Which source did the data originate from?
• How much supporting evidence was there?
–
Validating and enforcing constraints, e.g. cardinality
• President of the United States (has cardinality) =
Single
• Terrorist (has cardinality) = Multiple
Overcoming the key issue of resolving ambiguities
in facts & evidence (Contd…)
• Managing temporal aspects of the domain
–
Expiration of entity instances
–
E.g., “Hillary Clinton” is no longer the First Lady of the United
States but was until “May 3rd 2001”
• Providing auditing capabilities
–
Stamping evidence with date, time and source
–
E.g., Terrorist: “Seamus Monaghan”; date extracted: “2003-0130; time extracted: 16:45:27; source; FBI Watch list
• Ontological relationships makes for more expressive model and
provide better semantic description (compared to taxonomies)
–
Information can be presented in natural language format
–
E.g., “Bob Scott” is a founder member of business entity “AIX
LLP” that has traded in “Iran” that is on “FATF watch-list”
Example Scenario 1
Sample content text
Have you ever been to Athens?
How about Japan?
Ontology Matches:
- A: Athens[, Greece, Europe ]
- B: Athens[, Georgia, United States of America, North America ]
- C: Athens[, Ohio, United States of America, North America ]
- D: Athens[, Tennessee, United States of America, North America ]
-E: Japan[, Asia]
Scores:
A, B, C, D and E all scored equally – hence no ambiguity resolution possible
Example Scenario 2
Sample content text
Have you ever been to Athens?
Or anywhere else in Georgia?
How about Japan?
Ontology Matches:
- A: Athens[, Greece, Europe ]
- B: Athens[, Georgia, United States of America, North America ]
- C: Athens[, Ohio, United States of America, North America ]
- D: Athens[, Tennessee, United States of America, North America ]
- E: Georgia[, Asia ]
- F: Georgia[, United States of America, North America ]
- G: Georgia On My Mind, Inc.
-H: Japan[, Asia]
Scores:
B and F scored highest because of exact text match and relationship
Result:
Entity Ambiguity Resolved
Automatic Semantic Annotation of Text:
Entity and Relationship Extraction
KB, statistical
and linguistic
techniques
Semantic Enhancement Engine, 2002
Metadata Extraction and Semantic
Enhancement
[Hammond, Sheth, Kochut 2002]
Automatic Semantic Annotation
COMTEX Tagging
Value-added Semagix Semantic Tagging
Content
‘Enhancement’
Rich Semantic
Metatagging
Limited tagging
(mostly syntactic)
Value-added
relevant metatags
added by Semagix
to existing
COMTEX tags:
• Private companies
• Type of company
• Industry affiliation
• Sector
• Exchange
• Company Execs
• Competitors
© Semagix, Inc.
Ontology Design – Fundamental
Principles
• There is no one correct way to model a domain— there are
always viable alternatives.
• The best solution almost always depends on the application
that you have in mind and the extensions that you
anticipate.
• Ontology development is necessarily an iterative process.
• Concepts in the ontology should be close to objects
(physical or logical) and relationships in your domain of
interest.
• These are most likely to be nouns (objects) or verbs
(relationships) in sentences that describe your domain.
Ontology Development 101: A Guide to Creating Your First Ontology
Natalya F. Noy and Deborah L. McGuinness
Semantic Associations: Beyond
simple relationships
Mechanisms for querying about and
retrieving complex relationships
between entities.
1. A is related to B by x.y.z
x
y’
y
A
z
B
z’
u
?
v

C
Semantic Discovery project
2. A is related to C by
i. x.y’.z’
ii. u.v (undirected path)
3. A is “related similarly” to B
as it is to C
(y’  y and z’  z  x.y.z  x.y’.z’)
So are B and C related?
Why do we need this?
• Exploit ability to create ontologies
and metadata for knowledge
discovery and gaining insight
• Very useful in information analytics
– national security
– business intelligence
– biology
 - Association
– Two entities e1 and en are semantically
connected if there exists a sequence e1,
P1, e2, P2, e3, … en-1, Pn-1, en in an RDF
graph where ei, 1  i  n, are entities
and Pj, 1  j < n, are properties
Semantically Connected
“M’mmed”
&r1
“Abdulaziz”
&r6
“Atta”
“Alomari ”
&r5
 - Association
• Two entities are semantically similar if
both have ≥ 1 similar paths starting from
the initial entities, such that for each
segment of the path:
– Property Pi is either the same or subproperty
of the corresponding property in the other path
– Entity Ei belongs to the same class, classes
that are siblings, or a class that is a subclass
of the corresponding class in the other path
 - Association
Passenger
Ticket
Cash
“M’mmed”
&r1
purchased
&r2
paidby
&r3
Semantic
Similarity
Semantic
Similarity
Semantic
Similarity
“Atta”
“Marwan”
&r7
“Al-Shehhi”
lname
purchased
&r8
paidby
&r9
The Need For Ranking
• Current test bed with > 6,000
entities and > 11,000 explicit
relations
• The following semantic association query
(“Nasir Ali”, “AlQeada”), results in 2,234
associations
• The results must be presented to a user in
a relevant fashion…thus the need for
ranking
SemRank:Ranking complex relationship search
Context: Why, What, How?
• Context => Relevance; Reduction in
computation space
• Context captures the users’ interest
to provide the user with the relevant
knowledge within numerous
relationships between the entities
• By defining regions (or sub-graphs)
of the ontology we are capturing the
areas of interest of the user
Context Weight - Example
has Account
e3:Organization supports
e2:Financial
Organization
e6:Financial
Organization
works For
e4:Terrorist
Organization
e7:Terrorist
Organization
involved In
member Of
e5:Person
located In
e8:Terrorist
Attack
member Of
friend Of
at location
e1:Person
located In
Region1: Financial Domain, weight=0.50
Region2: Terrorist Domain, weight=0.75
e9:Location
Semantic (Web) Technology –
State of the Art
Semantic Technology – Key Features
• Design ontology schema
• Automatically Populate ontology with domain knowledge (at
Enterprise Scale)
• Maintain Freshness of ontology (almost) automatically
• Processing of heterogeneous information (structured, semitructured and unstructured)
• Automatic Semantic Metadata Extraction using lexical,
statistical or NLP techniques*
• Automatic Semantic Metadata Extraction using populated
ontology (Knowledgebase approach)
• Logic based reasoning (inferencing)
• Graph/relationship traversal based reasoning
Ontology-driven Information System Lifecycle
Schema
Creation
Analytic
Application
Creation
Ontology API
MB
Ontology
Population
KB
BSBQ
Application
Creation
Semantic Visualization
Metadata
Extraction
Semagix Freedom Architecture:
for building ontology-driven information system
Sheth et al, 2002 Managing Semantic Content for the Web
© Semagix, Inc.
Ontology Creation and Maintenance Steps
1. Ontology Model Creation (Description)
2. Knowledge Agent Creation
Ontology
Semantic Query
Server
4. Querying the Ontology
3. Automatic aggregation of Knowledge
© Semagix, Inc.
Download