Presentation in PPT

advertisement
SEMANTIC CONTENT MANAGEMENT
FOR ENTERPRISES AND NATIONAL
SECURITY
Keynote
CONTENT- AND SEMANTIC-BASED INFORMATION
RETRIEVAL @ SCI 2002
Amit Sheth
CTO, Voquette*, Inc.
Large Scale Distributed Information Systems (LSDIS) Lab
University Of Georgia; http://lsdis.cs.uga.edu
*Now Semagix, http://www.semagix.com
July 15, 2002
© Amit Sheth
New Enterprise
Content Management Challenges
1. More variety and complexity



More formats (MPEG, PDF, MS Office, WM, Real, AVI, etc)
More types (Docs, Images -> Audio, Video, Variety of textstructured, unstructured)
More sources (internal, extranet, internet, feeds)
2. Saclability, Information Overload

Too much data, precious little information (Relevance)
3. Creating Value from Content



How to Distribute the right content to the right people as needed?
(Personalization -- book of business)
Customized delivery for different consumption options
(mobile/desktop, devices)
Insight, Decision Making (Actionable)
New Enterprise Content Management
Technical Challenges
1. Aggregation


Feed handlers/Agents that understand content representation and
media semantics
Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured
data of different types
2. Homogenization and Enhancement


Enterprise-wide common view
 Domain model, taxonomy/classification, metadata standards
Semantic Metadata– created automatically if possible
3. Semantic Applications

Search, personalization, directory, alerts, etc. using metadata and
semantics (semantic association and correlation), for improved
relevance, intelligent personalization, customization
Semantics:
The Next Step in the Web’s Evolution
The Semantic Web -- a vision with several views:
•·“The Web of data (and connections) with meaning
in the sense that a computer program can learn
enough about what data means to process it.” [B99]
•·“The semantic Web is an extension of the current
Web in which information is given well-defined
meaning, better enabling computers and people to
work in cooperation.” [BHL01]
•·“The Semantic Web is a vision: the idea of having
data on the Web defined and linked in a way that it
can be used by machines not just for display
purposes, but for automation, integration and reuse
of data across various applications. [W3C01]
Semantics for the Web
On the Semantic Web every resource (people, enterprises,
information services, application services, and devices) are
augmented with machine processable descriptions to
support the finding, reasoning about (e.g., which service is
best), and using (e.g., executing or manipulating) the
resource. The idea is that self-descriptions of data and other
techniques would allow context-understanding programs to
selectively find what users want, or for programs to work on
behalf of humans and organizations to make them more
efficient and productive.
Central Role of Metadata
Back End
Applications
Produce
Aggregate
Catalog/
Index
Integrate
Syndicate
Personalize
Interactive
Marketing
Where is the
content?
Whose is it?
What is this
content
about?
What other
content is it
related to?
What is the right
content for this
user?
What is the
best way to
monetize this
interaction?
Semantic Metadata
Broadcast,
Wireline,
Wireless,
Interactive TV
"A Web content repository without metadata is like a library without an index." - Jack Jia, IWOV
“Metadata increases content value in each step of content value chain.” Amit Sheth
A Metadata Classification
User
More
Semantics
for
Relevance
to tackle
Information
Overload!!
Ontologies
Classifications
Domain Models
Domain Specific Metadata
area, population (Census),
land-cover, relief (GIS),metadata
concept descriptions from ontologies
Domain Independent (structural) Metadata
(C++ class-subclass relationships, HTML/SGML
Document Type Definitions, C program structure...)
Direct Content Based Metadata
(inverted lists, document vectors, LSI)
Content Dependent Metadata (size, max colors, rows, columns...)
Content Independent Metadata (creation-date, location, type-of-sensor...)
Data (Heterogeneous Types/Media)
Semantic Content Organization and Retrieval
Engine (SCORE) technology
• Automatically aggregates and extracts information
from disparate sources and multiple formats
• Automatically tags/annotates and categorizes
content
• Automatically creates relevant associations
- Maps content topics and their relationships
• Semantic query engine relates information and
knowledge both internal and external to the
organization into a single view
SCORE Architecture
Fast main-memory based query
engine with APIs and XML output
Distributed
Toolkit to agents
design that
and automatically
maintain the Knowledgebase
extract/mine
Distributed
agents
that
automatically
extract
relevant
Knowledgebase
represents
the
real-world
instantiation
CACS provides automatic classification
knowledge(w.r.t.
fromWorldModel)
trusted sources
semantic
metadata
structured
and unstructured content
relationships)
of from
the
WorldModel
from unstructured(entities
text andand
extracts
contextually
relevant
metadata
WorldModel specifies enterprise’s
normalized view of information (ontology)
Voquette Enterprise Semantic
Platform Product Components
Enhancement Engine
Automatic
Classification
Classification Committee
Entity Extraction,
Enhanced Metadata,
Domain Experts
Content
Sources
Content
Agents
Databases
CA
KB
Toolkit
CA
Email
CA
Reports
Documents
Content
Agent
Monitor
CA
Toolkit
Knowledge
Sources
KA
WM
Toolkit
KA
World Model
KS
KA
Metabase
KS
KS
Knowledgebase
XML/Feeds
Websites
Knowledge
Agents
Knowledge
Agent
Monitor
KS
KA
Toolkit
Enterprise
Applications
Knowledgebase
and
Metabase
Main Memory
Index
EA
XML APIs
Web
Services
Search Alerts Portals Personalize Directory
Semantic Engine
EA
EA
JIVA
Knowledge Sources Used
Office of Foreign Assets Control (OFAC)
Capital Advantage (CA)
Federal Bureau of Investigation (FBI)
The Interdisciplinary Center (ICT)
Central Intelligence Agency (CIA)
Federation of American Scientists (FAS)
Data supplied from NASA (DPL)
Hoover’s (H)
ZDNet (ZD)
Market Guide (MG)
Entity Classes and Relationships populated by these knowledge sources:
PERSON (OFAC, FBI, DPL)
THING
-politician (OFAC, FBI, CIA, CA)
-event (ICT)
politician associated with politicalOrganziation
terroristOrganization participated in terroristSponsoredEvent (ICT)
politician held politicalOffice
-politicalOffice (CIA, CA)
politician associated with politicalOffice
politicalOffice office(s) within govtOrganization
-terrorist (OFAC, FBI, DPL)
terrorist memberOf organization
terrorist appears on watchList
-companyExecutive (MG)
companyExecutive holdsOffice companyPosition
person has permanent address address (OFAC, FBI)
person has dob(date of birth) (OFAC, FBI)
person has pob(place of birth) (OFAC, FBI)
politicalOffice associated with organization
-watchList (OFAC, FBI, DPL)
terroristOrganization appears on watchList (OFAC, FBI, DPL)
-organization (OFAC, FBI, FAS, ICT, CA, CIA)
organization appears on watchList
organization memberOf suborganization
-company
company manufactures product (ZD)
company identifiedBy tickeySymbol (H)
PLACE
-organization located in place (H, OFAC)
-religiousAffiliation practiced in place (CIA)
-company headquarters in city (H)
companyposition position in company (MG)
company memberOf industry (H)
-tickerSymbol (H)
tickerSymbol memberOf exchange (H)
SCORE Capabilities
• Semantics (understanding of content and user needs)
• Extreme relevance
• Semantic associations
• Near real-time
• Multiple applications/usage patterns (not just search)
• Automation
• Scalability in all aspects
Technologies Involved
• Ontology driven architecture (definitional,
assertional components
• Automatic Classification with classifier committee
(multiple technologies, rather than one size fits all)
• Automatic Semantic Metadata
Extraction/Annotation
• Semantic associations/ knowledge inferences
• Scalability throughout with distributed architecture
and implementation (number of content and
knowledge sources, indexing, etc.)
• Main memory implementation, incremental check
pointing
Performance
Queries per server per hour
> 1,980,000
Query Response Time (light load)
1 - 10 ms
Query Response Time (64 concurrent
users)
65ms
Incremental Index Update Frequency
1 minute (near real-time)
Population/update rate in a
Knowledgebase with 1 million
entities/relationships
> 10,000 entities/relationships per hr.
Information Extraction for Metadata Creation
WWW, Enterprise
Repositories
Nexis
UPI
AP
Feeds/
Documents
Digital Videos
...
...
Data Stores
Digital Maps
...
Digital Images
Key challenge:
Create/extract as much (semantics)
metadata automatically as possible
Digital Audios
EXTRACTORS
METADATA
Automatic Categorization & Metadata
Tagging (Web page)
Video with
Editorialized
Text on the Web
Auto
Categorization
Semantic Metadata
Content Extraction and
Knowledgebase Enhancement
Web Page
Enhanced Metadata Asset
Extraction
Agent
Content Enhancement Workflow
Syntax Metadata
Semantic Metadata
Content Asset Index Evolution
Asset
Syntax Metadata
Producer: BusinessWire
Source: Bloomberg
Date: Sept. 10 2001
Location: San Jose, CA
URL: http://bloomberg.com/1.htm
Media: Text
Asset
Syntax Metadata
Producer: BusinessWire
Source: Bloomberg
Date: Sept. 10 2001
Location: San Jose, CA
URL: http://bloomberg.com/1.htm
Media: Text
Semantic Metadata
Company: Cisco Systems, Inc.
Topic: Company News
Semantic Metadata
Company: Cisco Systems, Inc.
Creates asset (index)
out of extracted
metadata
Scans text
for analysis
Metadata
extracted
automatically
Extractor
Agent
for
Bloomberg
Scans text
for analysis
XML Feed
Semantic
Engine
Syntax Metadata
Asset
Producer: BusinessWire
Source: Bloomberg
Date: Sept. 10 2001
Location: San Jose, CA
URL: http://bloomberg.com/1.htm
Media: Text
Semantic Metadata
Company: Cisco Systems, Inc.
Topic: Company News
Ticker: CSCO
Exchange: NASDAQ
Industry: Telecomm.
Sector: Computer Hardware
Executive: John Chambers
Competition: Nortel Networks
Headquarters: San Jose, CA
Categorization &
Auto-Cataloging
System (CACS)
Classifies document into
pre-defined category/topic
Leverages
knowledge
to enhance
metatagging
Enhanced
Content Asset
Indexed
Appends
topic
metadata
to asset
Knowledge Base
Headquarters
Sector
San Jose
Executives
Computer
Hardware
Industry
John Chambers
Cisco
Systems
Company
Telecomm.
Exchange
NASDAQ
Competition
Ticker
CSCO
Nortel Networks
Intelligent Content Empowers the User
End-User
Intelligent Content
Content which does
contain the words
the user asked for
Extractor Agents
+
Content which does not
contain the words
the user asked for, but
is about what he asked
for.
Value-added Metadata
+
Content the user did not
think to ask for, but
which he needs to
know.
Semantic Associations
Example 1 – Snapshots (“Jamal Anderson”)
Search for ‘Jamal
Anderson’ in ‘Football’
Click on first result for
Jamal Anderson
View the original source
HTML page. Verify that
the source page contains
no mention of Team name
and League name. They
are value-additions to the
metadata to facilitate
easier search.
View metadata. Note that
Team name and League
name are also included in
the metadata
Semantic Application Example
– Research Dashboard
Automatic
3rd party
content
integration
Focused
relevant
content
organized
by topic
(semantic
categorization)
Related relevant
content not
explicitly asked for
(semantic
associations)
Competitive
research
inferred
automatically
Automatic Content
Aggregation
from multiple
content providers
and feeds
Semantic Web – Intelligent Content
Intelligent Content = What You Asked for + What you need to know!
Related
Stock
News
COMPANY
Competition
COMPANIES in
INDUSTRY with
Competing PRODUCTS
COMPANIES in Same or
Related INDUSTRY
Regulations
Technology
Products
Important to INDUSTRY
or COMPANY
Industry
News
EPA
Impacting INDUSTRY
or Filed By COMPANY
SEC
Knowledge-based & Manual Associations
Syntax Metadata
Same
entity
led by
Semantic Metadata
Humanassisted
inference
Intelligence Analyst Browsing
Scenario
Innovations that affect User Experience
• BSBQ: Blended Semantic Browsing and Querying
– Ability to query and browse relevant desired content in a highly contextual manner
• Seamless access/processing of Content, Metadata and Knowledge
– Ability to retrieve relevant content, view related metadata, access relevant knowledge
and switch between all the above, allowing user to follow his train of thought
• dACE: dynamic Automatic Content Enhancement
– Ability to provide enhanced annotation features, allowing the user to retrieve relevant
knowledge about significant pieces of content during content consumption
• Semantic Engine APIs with XML output
– Ability to create customized APIs for the Semantic Engine involving Semantic
Associations with XML output to cater to any user application
Boarding Gate
Interrogation
Security Portal
ARC AvSec Manager
Data Management
Data Mining
Check-in
IPG
Airport
Airspace
Visionics
AcSys
Voquette
Knowledgebase
Metabase
Threat Scoring
Airport LEO
Passenger Records
Reservation Data
Airline Data
Airport Data
Airline and Airport Data
Gov’t Watchlists
News Media
Web Info
LexisNexis
RiskWise
Future and Current
Risks
Sources Used
Knowledge Sources:
Content Sources :
FBI - Most Wanted Terrorists
Denied Persons Lists
Terrorism Files
ICT
Office of Foreign Asset Control
(OFAC)
Hamas terrorists
CNN Locations
FAA_Airport_Codes
About.com
Comtex_International
Hindustan Times
JerusalemPost
CNN
Newstrove_Hamas
Africa News Service
AFX News – Asia/UK/Europe
AP Worldstream
Asia Pulse
BusinessWire
ComputerWire (CTW)
EFE News Services
FWN Select
Itar-TASS
Knight Ridder News (Open)
Knight-Ridder Open
M2 - International
M2 Airline Industry Information
New World Publishing
PR Newswire
PRLine (PRL)
Resource News International
RosBusiness
United Press International
UPI Spotlights
Interrogation Kiosk –
Unique Advantages of Voquette
Voquette’s Semantic
Technology enables flight
authorities to :
- take a quick look at the
passenger’s history
- check quickly if the passenger is
on any official watchlist
- interpret and understand
passenger’s links to other
organizations (possibly terrorist)
- verify if the passenger has
boarded the flight from a “high
risk” region
John
Smith
- verify if the passenger originally
belongs to a “high risk” region
- check if the passenger’s name
has been mentioned in any news
article along with the name of a
known bad guy
Threat Score Components
Flight Coutry Check
45
Person Country Check
25
0.15
Nested Organizations Check
75
0.8
Aggregate Link Analysis Score: 17.7
appearsOn watchList:
FBI
KNOWLEDGEBASE SEARCH
John
Action: Voquette’s rich knowledgebase is
METABASE
LEXIS
LINK
ANALYSIS
NEXISSEARCH
ANNOTATION
searched for this name and associated
WATCHLIST ANALYSIS
information
Action:
Voquette’s
Information
Semantic
like position,
analysis
rich
about
metabase
aliases,
or
of related
the relationships
various
is to
searched
thecomponents
Action:
for
(past
passenger
(watchlist,
this
orname
present)
Voquette’s
Lexis
returned
andNexis,
ofassociated
rich
this
by Lexis
knowledgebase
name
knowledgebase
Nexis
content
to other
is stories
search,
is
metabase
automatically
mentioning
organizations,
enhanced
search,
etc.)
by
the
to
linking
searched
watchlists,
passenger’s
comeimportant
upfor
with
country,
the
name
an
entities
possible
aggregate
etc.
are to
are threat score for
appearance
retrieved
Voquette’s
the
passenger
rich
of this
knowledgebase
name on any of the
watchlists
Ability Proven: Ability to automatically aggregate relevant
Ability
aggregate
rich
domain
Proven:
and
relevant
knowledge,
retrieve
Ability
rich to
relevant
domain
recognize
automatically
knowledge
knowledge,
content
entities in a piece of
aggregate
stories,
about
recognize
text,
automatically
a field
passenger
entities
relevant
reports,
in
co-relate
and
rich
aetc.
piece
automatically
domain
about
itofwith
text
the
knowledge
other
and
passenger
co-relate
further
data
and
in the
itthat
automatically
knowledgebase,
with
can
other
be used
data
co-relate
search
in
bythe
flight
itknowledgebase
for
and
with
officials
relevant
rank
otherto
the
data
content
determine
threat
toin the
to present an
iffactors
present
knowledgebase
overall
the passenger
idea
to
a indicate
visual
of the
to
association
has
threat
present
any level
connections
a picture
clear
of
fo the
picture
to
passenger
passenger,
with
the flight allowing
on the
known
official
about
him
to the
take
watchlist
badpassenger
people
quickfront
action
or to
organizations
the flight official
Smith
0.15
Query Comparison:
Voquette vs. RDBMS
What it will take RDBMS to support flight security application
Link Analysis Component
Direct Watchlist Match (person name)
lookup person entity
retrieve person's relationships to watchlists
Organization Watchlist Match (person name, organization name)
lookup person entity
retrieve person's relationships to organizations
retrieve the organizations' relationships to watchlists
look up organization entity
retrieve the organizations' relationships to watchlists
Nested Organization Watchlist Match (person name, organization name)
look up organization entity
retrieve the organization's relationships to organizations
retrieve the organizations' relationships to watchlists
Flight Origin (country name)
retrieve country entity
see if country is on a list containing "high-risk" countries
Person Origin (person name)
lookup person entity
retrieve person's home country
retrieve the organization's relationships to lists containing "high-risk" countries
Field Report Search (person name)
perform SSE query for field reports that mention this person
retrieve a list of people associated with these field reports
determine which people are on watchlists, terrorists, etc…
# Queries (Voquette) # Queries (RDBMS)
Time (Voquette) Time (RDBMS)
1 CACS Request
1 SQL Query
5-10 SQL Queries
1 SQL Query
.05 sec
.005 sec
5-10 sec.
.005 sec
1 CACS Request
1 SQL Query
1 SQL Query
1 CACS Request
1 SQL Query
5-10 SQL Queries
1 SQL Query
1 SQL Query
5-10 SQL Queries
1 SQL Query
.05 sec
.005 sec
.005 sec
.05 sec
.005 sec
5-10 sec.
.005 sec
.005 sec
5-10 sec.
.005 sec
1 CACS Request
1 SQL Query
1 SQL Query
5-10 SQL Queries
1 SQL Query
1 SQL Query
.05 sec
.005 sec
.005 sec
5-10 sec.
.005 sec
.005 sec
1 SQL Query
1 SQL Query
1 SQL Query
1 SQL Query
.005 sec
.005 sec
.005 sec
.005 sec
1 CACS Request
1 SQL Query
1 SQL Query
5-10 SQL Queries
1 SQL Query
1 SQL Query
.05 sec
.005 sec
.005 sec
5-10 sec.
.005 sec
.005 sec
1 SSE Request
1 SQL Query
1 SQL Query
2 SQL Queries
1 SQL Query
1 SQL Query
.03 sec
.005 sec
.005 sec
5-30 sec
.005 sec
.005 sec
18 requests
39-64 SQL Queries
.33 sec
30-80 sec.
JIVA Functionality Interface
JIVA Semantic Console Start-up Interface
The mission of the JIVA project is to gather and analyze as much information of diverse kinds about suspected individuals,
terrorist and other groups, organizations, events, etc. For this Terrorism domain, the JIVA Semantic Console provides an
information retrieval interface (shown below) that displays some fundamental semantic attributes (based on a
corresponding Terrorism domain model) to enable information retrieval in the right context.
Search interface with
more search features
(explained later)
Most fundamental
semantic attributes
specific to the
Terrorism domain
(fully customizable)
Analyst can enter
search values in the
appropriate attribute
fields (to search
in the right context)
Syntactic or
domain-independent
attributes for general
and media-specific
search
Once all other values
are set, click the
“Search” button to
search semantically
Analyst can choose
the type of media
of the desired content
JIVA
“Complete Picture” View – Knowledgebase Results
This section of the ‘Complete Picture’ shows factually known real-world information about the entity (person, organization,
event, etc.) of interest along with its contextual classification(s) and relationships with other entities in the Knowledgebase,
to provide a comprehensive overview of the entity.
Such knowledge is kept up-to-date by means of automated knowledge extractor agents that aggregate such knowledge
about millions of entities from various trusted knowledge sources.
Fraud investigation of
focal entity placing it in
one of five levels of
threats, based on score
Entity’s classifications
in taxonomy
Entity’s aliases and
other names
Entity’s real-world
relationships to various
other entities across
multiple entity classes
(as defined in the
Terrorism domain model)
Individual related
entities are clickable
to navigate to a new
knowledge page for
that entity e.g. Al Qaeda
- Knowledgebase
Navigation
Entity’s canonical name
While browsing through
relevant knowledge,
analyst can search
for content on the
focal entity or any of
the related entities.
The analyst can also
search for specific
relationships between
two or more entities
by checking
corresponding
entity boxes for search
- Blended Semantic
Browsing & Querying
(BSBQ)
JIVA
Facilitating Knowledge Discovery
On clicking any bin Laden-related entity (e.g. Al Qaeda), a page is
displayed to the analyst showing knowledge pertaining to that
entity, which can be used in a BSBQ mode, as described on the
previous screen.
Continuing this integrated approach of Semantic Browsing and
Querying, the analyst has the necessary ammunition to perform
Knowledge Discovery. The analyst can follow his train of thought
as he browses and queries to possibly discover unexpected
relationships and links between entities at various levels in an
indirect manner. Automatically uncovering such hidden related
entities facilitates addition of new and meaningful entities and
relationships to the analyst’s assessment tasks.
Wireless Application of Semantic Metadata
and Automatic Content Enrichment
CSCO Analysis
My Stocks
MyMedia
$
Analyst Call
MyStocks
 News
w Sports
 Music
CSCO
%
%
CSCO
NT
Conf Call
11/08 ON24 Payne
11/07 ON24 H&Q CC
11/06 CBS Langlesis
Earnings
IBM
Market
%
Clicking on the link for Cisco Analyst Calls displays a listing
sorted by date. Semantic filtering uses just the right metadata to meet screen and
other constrains. E.g., Analyst Call focuses on the source and analyst name or
company. The icon denote additional metadata, such as “Strong Buy” by H&Q
Analyst.
Metadata’s role in emerging
iTV infrastructure
Video
Enhanced
Digital Cable
MPEG-2/4/7
MPEG
Encoder
Create Scene Description Tree
Channel sales
through Video Server Vendors,
Video App Servers, and Broadcasters

MPEG
Decoder
GREAT
USER
EXPERIENCE
Retrieve Scene Description Track
License metadata decoder and
semantic applications to
device makers
Node = AVO Object
Scene
Description
Tree
“NSF Playoff”
Node
Voqutte/Taalee
Semantic
Engine
Produced by: Fox Sports
Creation Date: 12/05/2000
League: NFL
Teams: Seattle Seahawks,

Atlanta Falcons
Players: John Kitna
Coaches: Mike Holmgren,

Dan Reeves
Location: Atlanta
Object Content Information (OCI)
Enhanced
XML
Description
“NSF Playoff”
Metadata-rich
Value-added Node
Metadata for Automatic Content
Enrichment
Interactive Television
This screen is customizable
with interactivity feature
using metadata such as whether
there is a new Conference
Call video on CSCO.
Part of the screen can be
automatically customized to
show conference call specific
information– including transcript,
participation, etc. all of which are
relevant metadata
Conference Call itself can have
embedded metadata to
support personalization and
interactivity.
This segment has embedded or referenced metadata that is
used by personalization application to show only the stocks
that user is interested in.
Future
• Multimodal interfaces
• Multimodal semantics
• Multivalent Semantics
Metadata Usage: Keyword, Attribute
and Content Based Access
Download