SCORE-examples

advertisement
Content Management,
Metadata & Semantic Web
Keynote Address
Net.ObjectDAYS 2001, Erfurt, Germany, September 11, 2001
Amit Sheth
CTO/SrVP, Voquette (www.voquette.com)
[formerly Founder/CEO, Taalee, www.taalee.com]
Director, Large Scale Distributed Information Systems Lab,
University Of Georgia (lsdis.cs.uga.edu)
amit@sheth.org
Metadata Extraction is a patented pending technology of Taalee, Inc.
Semantic Engine and WorldModel are trademarks of Taalee. Inc.
Confidential
HP
Enterprise Content Management
– sample user requirements (from a large Financial Svcs Company)
 “If a new bond comes into inventory, then we should get a
message, an alert...and be able to refine to say that I only
have California, Oregon and Washington clients...."
 “In the month of July, I received 95 e-mails from my
subscriptions. These e-mails included 61 that had 143
attachments that had 67 more attachments. In total therefore,
I received almost 400 documents including 5 different types
(HTML,PDF, Word, Rich Media, …). Even with this volume, I
had subscribed to only 10 categories in the Equities area.
There are a total of 26 Equity Subscription areas and a total of
166 categories to which a user can subscribe across all
Product Areas.”
Professional users of a traditional Content Management Product/Solution
HP 2
Enterprise Content Management
– sample user requirements (from a large Financial Svcs Company)
 The real question is, "Which sales ideas may have significant
relevance to my book of business?" For example, an earnings
warning on an equity rated Hold or Lower and not owned by any
of my clients may not be of high relevance to me. Ideally, a
relevance analysis would:
 Greatly reduce the volume of Product Area Ideas sent to every FA,
hopefully to perhaps 10% to 20% or less of today's volume with
ideas that are potentially actionable for that FA and his/her client
 Result in FAs reading and evaluating the Product Area Ideas, taking
appropriate actions, and generating sales because the Product
Area Ideas would be relevant
 Result in customer satisfaction because clients would understand
FAs are paying attention to their needs and developing focused
ideas
Professional users of a traditional Content Management Product/Solution
HP 3
Enterprise Content Management
– sample product requirements (from a large Financial Svcs Company)
 “Content generation is a more complex and probably costly
problem to solve ... we reportedly create about 9 million
messages a month for field delivery. On average, this would
mean 1,000 messages per month per ‘big user’ or perhaps
only 500 to 600 per ‘little user’.…I strongly believe an analysis
is in order of the nature and necessity of generated content ,
the establishment of content generation standards, the
movement towards development and implementation of a
relevance engine, … “
Director (Product Management) of a large company that uses a leading Content Management Product
HP 4
New Enterprise Content Management
Challenges
1. More variety and complexity



More formats (MPEG, PDF, MS Office, WM, Real, AVI, etc)
More types (Docs, Images -> Audio, Video, Variety of textstructured, unstructured)
More sources (internal, extranet, internet, feeds)
2. Information Overload

Too much data, precious little information (Relevance)
3. Creating Value from Content



How to Distribute the right content to the right people as needed?
(Personalization -- book of business)
Customized delivery for different consumption options
(mobile/desktop, devices)
Insight, Decision Making (Actionable)
HP 5
New Enterprise Content Management
Technical Challenges
1. Aggregation


Feed handlers/Agents that understand content representation and
media semantics
Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured
data of different types
2. Homogenization and Enhancement


Enterprise-wide common view
 Domain model, taxonomy/classification, metadata standards
Semantic Metadata– created automatically if possible
3. Semantic Applications

Search, personalization, directory, alerts, etc. using metadata and
semantics (semantic association and correlation), for improved
relevance, intelligent personalization, customization
HP 6
Semantics
 “meaning or relationship of meanings, or relating to meaning”
(Webster)
 is concerned with the relationship between the linguistic
symbols and their meaning or real-world objects
 meaning and use of data (Information System)
Example: Palm -> Company, Product, Technology, Tree Name, part
of location (Palm Spring, Palm Beach)
Semantics, Ontologies (Domain Models), Metamodels,
Metadata, Content/Data
HP 7
Semantics:
The Next Step in the Web’s Evolution
“The Web of data (and connections) with meaning in the
sense that a computer program can learn enough about
what the data means to process it. . . . Imagine what
computers can understand when there is a vast tangle of interconnected
terms and data that can automatically be followed.” (Tim Berners-Lee,
Weaving the Web, 1999)
A Content Management centric definition of
Semantic Web: The concept that Web-accessible
content can be organized and utilized semantically,
rather than though syntactic and structural methods.
HP 8
Organizing Content
Different and Related Objectives: Search, Browse, Summarization,
Association/Relationships
 Indexing
 Clustering
 Classification
 Controlled Vocabulary, Reference Data/ Dictionary/Thesaurus
 Metadata
 Knowledge Base (Entities/Objects and Relationships)
HP 9
Traditional Text Categorization
Customer
Training
Set
Statistical/AI
Techniques
Classify
Place in
a taxonomy
Routing/Distribution
Customer
Article Feed
4715
Most traditional Content Management Products support
Categorization of unstructured content..
Classification of
Article 4715
Standard Metadata
Feed Source: iSyndicate
Posted Date: 11/20/2000
HP 10
Voquette/Taalee’s Categorization & Automatic Metadata Creation
Knowledge-base &
Statistical/AI Techniques
Taalee
Training
Set & KB
Classify
Place in
a taxonomy
Catalog
Metadata
Automated Content
Enrichment (ACE)
FTE
Article 4715 Metadata
Standard
metadata
Customer
Training
Set & KB
Semantic
metadata
Feed Source: iSyndicate
Posted Date: 11/20/2000
Company Name: France Telecom,
Equant
Ticker Symbol: FTE, ENT
Exchange: NYSE
Topic: Company News
Company Analysis
Conference Calls
Earnings
Stock Analysis
ENT
Company Analysis
Conference Calls
Earnings
Stock Analysis
NYSE
Member Companies
Market News
IPOs
Classification
of Article 4715
Article Feed
4715
Semantic Engine™
Precise Personalization/
Syndication/Filtering
Routing/Distribution
Map to another taxonomy
HP 11
Technologies for Organizing Content
 Information Retrieval/Document Indexing
 TF-IDF/statistical, Clustering, LSI
 Statistical learning/AI: Machine learning, Bayesian, Markov
Chains, Neural Network
 Lexical, Natural language
 Thesaurus, Reference data, Domain models (Ontology)
 Information Extractors
 Reasoning/Inferencing: Logic based, Knowledge-based, Rule
processing and
Most powerful solutions require combine several of these,
addressing more of the objectives
HP 12
Ontology
 Standardizes meaning, description, representation
of involved concepts/terms/attributes
 Captures the semantics involved via domain
characteristics, resulting in semantic metadata
 “Ontological Commitment” forms basis for
knowledge sharing and reuse
Ontology provides semantic underpinning.
HP 13
An Ontology
Terms/Concepts
(Attributes)
site
latitude
longitude
Functional
Dependencies
(FDs)
eventDate
description
Disaster
Hierarchies
site => latitude,
longitude
damage
damagePhoto
Natural
Disaster
Man-made
Disaster
bodyWaveMagnitude
numberOfDeaths
conductedBy
magnitude
Volcano
explosiveYield
NuclearTest
magnitude > 0
Earthquake
bodyWaveMagnitude > 0
magnitude < 10
bodyWaveMagnitude < 10
Domain Rules
HP 14
Controlled Vocabularies/
Classifications/Taxonomies/Ontologies
 WordNet
 Cyc
 The Medical Subject Headings (MeSH): NLM's controlled
vocabulary used for indexing articles, for cataloging books and other
holdings, and for searching MeSH-indexed databases, including
MEDLINE. MeSH terminology provides a consistent way to retrieve
information that may use different terminology for the same
concepts. Year 2000 MeSH includes more than 19,000 main
headings, 110,000 Supplementary Concept Records (formerly
Supplementary Chemical Records), and an entry vocabulary of over
300,000 terms.
HP 15
Open Directory Project (ODP):
Classification/Taxonomy & Directory
HP 16
Example 1 – Snapshots (“Jamal Anderson”)
Search for ‘Jamal
Anderson’ in ‘Football’
Click on first result for
Jamal Anderson
View the original source
HTML page. Verify that
the source page contains
no mention of Team name
and League name. They
were Taalee’s valueadditions to the metadata
to facilitate easier search.
View metadata. Note that
Team name and League
name are also included in
the metadata
HP 17
Example 2 – Snapshots (“Gary Sheffield”)
Search for ‘Gary
Sheffield’ in ‘Baseball’
Click on first result for
Gary Sheffield
View the original source
HTML page. Verify that
the source page contains
no mention of Team name
and League name. They
were Taalee’s valueadditions to the metadata
to facilitate easier search.
View metadata. Note that
Team name and League
name are also included in
the metadata
HP 18
Semantic Web – Intelligent Content
(supported by Taalee Semantic Engine)
Intelligent Content = What You Asked for + What you need to know!
Related
Stock
News
COMPANY
Competition
COMPANIES in
INDUSTRY with
Competing PRODUCTS
COMPANIES in Same or
Related INDUSTRY
Regulations
Technology
Products
Important to INDUSTRY
or COMPANY
Industry
News
EPA
Impacting INDUSTRY
or Filed By COMPANY
SEC
HP 19
Semantic Application – Equity Dashboard
Automatic
3rd party
content
integration
Focused
relevant
content
organized
by topic
(semantic
categorization)
Related
news not
specifically
asked for
(Semantic
Associations)
Competitive
research
inferred
automatically
Automatic Content
Aggregation
from multiple
content providers
and feeds
HP 20
ASP/Enterprise
hosted
Internal Source 1
Research
Extractor
Agent 1
2
World Model
Consults
Knowledge
Base
for Cisco’s
competition
Internal Source 2
Extractor
Agent 2
3
External feeds/Web
(e.g. Reuters)
Extractor
Agent 3
Lucent story
from external
feeds picked for
publishing as
“semantically
related” to Cisco
story – passed
on to Dashboard
Returns result:
Lucent is a
competitor of
Cisco
Story on
Cisco
Semantic
Engine
Semantic
Application
4
1
Cisco story from
Source 1
passed on to add
semantic
associations
Story on
Lucent
Voquette
Metabase
Metadata centric
Content Management Architecture
XCM-compliant
metadata, XML or
other format
Third-party
Content Mgmt
And
Syndication
HP 21
Semantic Technology Features
















Unstructured Text Content
Semi-Structured Content
Structured Content
Audio/Video Content with associated text (transcript, journalist notes)
Create a Customized "World Model" (Taxonomy Tree with customized domain
attributes)
Automatically homogenize content feed tags
Automatically categorize unstructured text
Automatically create tags based on text Itself
Create and maintain a Customized Knowledge Base for any domain
Automatically enhance content tags based on information beyond text
Build contextually relevant custom research applications
Contextual Search (an order of magnitude better than keyword-based search)
Support push or pull delivery/ingestion of content
Personalization/Alerts/Notifications
Real Time Indexing (stories indexed for search/personalization within a minute)
Provide the user with relevant information not explicitly asked for (Semantic
Associations)
HP 22
Along with the evolution of
metadata and semantic
technologies enabling the next
generation of the Web, Content
Management has entered the
next generation of Enhanced
Content Management.
Confidential
HP
Resources/References
















RDF:www.w3.org/TR/REC-rdf-syntax/
ICE: www.icestandard.org
Meta Object Facility (MOF) Specification, Version 1.3, September 27, 1999:
http://cgi.omg.org/cgi-bin/doc?ad/99-09-05
XML Metadata Interchange (XMI) Specification, Version 1.1, October 25, 1999:
http://cgi.omg.org/cgi-bin/doc?ad/9910-02
http://cgi.omg.org/cgi-bin/doc?ad/99-10-03
DAML: www.daml.org
NEWSML: newsshowcase.reuters.com
PRISM: www.prismstandard.org/techdev/prismspec1.asp
RIXML: www.rixml.org
XCM: www.vignette.com
OIL: www.ontoknowledge.org/oil
SEMANTICWEB: www.semanticweb.org, business.semanticweb.org
VOICEXML: www.voicexml.org
MPEG7: www.darmstadt.gmd.de/mobile/MPEG7/
Taalee: www.taalee.com
Applied Semantics: www.appliedsemantics.com
Ontoprose: www.ontoprise.com
Multimedia Data Management: Using Metadata
to Integrate and Apply Digital Media, Amit Sheth
& Wolfgang Klas, Eds., McGraw Hill, ISBN: 0-07057735-8, 1998.
Information Brokering, Vipul Kashyap & Amit
Sheth, Kluwer Academic Publishers, 2001.
Voquette Semantic Technology White Paper.
Mysteries of Metadata, Speaker – Amit Sheth,
Workshop at Content World 2001.
Infoquilt Project, LSDIS lab.
http://www.taalee.com
http://lsdis.cs.uga.edu/~amit
Download