slides - VLDB 2005

advertisement
The SphereSearch Engine for Unified
Ranked Retrieval of Heterogeneous
XML and Web Documents
Ralf Schenkel
joint work with Jens Graupmann and Gerhard Weikum
Outline
• Where existing search engines fail
• SphereSearch Concepts
• Transformation and Annotation
• Query Language and Scoring
• Experimental Evaluation
• Summary
VLDB 2005, Trondheim, Norway
2
Example query #1
Which professors from Saarbrücken do research on XML
Different terminology in query and Web pages
Director of Department 5
DBS & IS
Professor at
Saarland University
Abstraction
Awareness
VLDB 2005, Trondheim, Norway
3
Example query #2
?
Conferences about XML in Norway 2005
Information is not present on a single page,
but distributed across linked pages
VLDB Conference 2005,
Trondheim, Norway
Call for Papers
…XML…
Context Awareness
VLDB 2005, Trondheim, Norway
4
Example query #3
What are the publications of Max Planck?
Max Planck should be instance of
concept person, not of concept institute
Concept Awareness
VLDB 2005, Trondheim, Norway
5
SphereSearch Concepts
Goal: Increase recall & precision for hard
queries on linked and heterogeneous data
• Unified search for unstructured, semistructured,
structured data from heterogeneous sources
• Graph-based model, including links
• Annotation engines from NLP to recognize classes
of named entities (persons, locations, dates, …) for
concept-aware queries
• Flexible yet simple abstraction-aware query
language with context-aware scoring
• Compactness-based scores
VLDB 2005, Trondheim, Norway
6
Some Related Work
• Web Query Languages
e.g., W3QS [VLDB95], WebOQL [ICDE95],…
• Web IR with thesauri
e.g., Qiu et al.[SIGIR93], Liu et al.[SIGIR04],…
• XML IR
e.g., XXL [WebDB00], XIRQL [SIGIR01],
XSearch [VLDB93], XRank [SIGMOD03], …
• Information extraction
e.g., Lixto, KnowItAll, …
• Advanced Web graph IR
e.g., BANKS [ICDE02], Hristidis et al.[VLDB03], …
VLDB 2005, Trondheim, Norway
7
Outline
• Where existing search engines fail
• SphereSearch Concepts
• Transformation and Annotation
• Query Language and Scoring
• Experimental Evaluation
• Current and Future Work
VLDB 2005, Trondheim, Norway
8
Unifying Search on Heterogeneous Data
Web
Intranet
XML
Heuristics, type-spec
transformations
Databases
Enterprise
Information
Systems
…
VLDB 2005, Trondheim, Norway
9
Heuristic Transformation of HTML
Goal: Transform layout tags
to semantic annotations
• Headlines
<h1>Experiments</h1>
<h2>Settings</h2>
We evaluated...
<h2>Results</h2>
Our system...
<Experiments>
<Settings>...</Settings>
<Results>...</Results>
</Experiments>
• Patterns
<b>Topic:</b>XML
<Topic>XML</Topic>
• Rules for tables, lists, …
VLDB 2005, Trondheim, Norway
10
(Almost) Generic XML Data Model
<Professor>
Gerhard Weikum
<Course>
IR
</Course>
Saarbrücken
<Research>
XML
</Research>
</Professor>
person
docid=1
tag=“Professor“
1 content=“Gerhard Weikum Saarbrücken“
docid=1
2 tag=“Course“
content=“IR“
3
docid=1
tag=“Research“
content=“XML“
location
AutomaticTags
annotation
important
annotateof
content
with
concepts (persons,
locations,
dates,
corresponding
concept
money amounts) with tools from
Information Extraction
VLDB 2005, Trondheim, Norway
11
Information Extraction (IE)
• Named Entity Recognition (NER)
• Named Entity ~ abstract datatype, concept
(location, person,…, IP-address)
• Mature (out-of-the-box products, e.g. GATE/ANNIE)
• Extensible
The
Hotel
in Salvador,
by in
The Pelican
<company>
Pelican
Hotel operated
</company>
Roberto
Cardoso,
offers
comfortable
roomsbystarting at
<location>
Salvador
</location>,
operated
$100
a night,
including
breakfast.
<person>
Roberto
Cardoso
</person>, offers
Please
checkrooms
in before
7pm.
comfortable
starting
at
<price> $100 </price> a night, including
breakfast. Please check in before <time> 7pm </time>.
VLDB 2005, Trondheim, Norway
12
Unifying Search on Heterogeneous Data
Web
Intranet
Databases
XML
Heuristics, type-spec
transformations
Annotation of named entities
with IE tools (e.g., GATE)
Enterprise
Information
Systems
…
Annotated
XML
VLDB 2005, Trondheim, Norway
13
Annotation-Aware Data Model
<Professor>
Gerhard Weikum
<Course>IR</Course>
Saarbrücken
<Research>XML</Research>
</Professor>
1
2
Annotation
introduces
new tags
2
docid=1
tag=“Course“
content=“IR“
docid=1
tag=“Professor“
content=“Gerhard Weikum Saarbrücken“
docid=1
tag=“Course“
content=“IR“
3
docid=1
tag=“Research“
content=“XML“
Annotation with GATE:
„Saarbrücken“ of type „location“
docid=1
tag=„Professor“
1 content=“Gerhard Weikum“
docid=1
tag=“location“
4 content=“Saarbrücken“
VLDB 2005, Trondheim, Norway
3
docid=1
tag=“Research“
content=“XML“
14
Data Model for Links
VLDB 2005, Trondheim, Norway
15
Architecture
Search
Search Engine
INDEX
Engine
FROM=SIGIR
Location=
Frankfurt
Location= Salvador
Event=SIGIR
…
Price =89 $
Location=Salvador
Person=Schenke
l
Time = 13:15
Annotators
Adapters
Annotation Module Annotation Module
DATE
PRICE
Web Portal
IE Processor
…
…
Web Adapter
Adapter
Flight
Schedule
SUBJECT=Notificati
on
Location=Salvad
or
Annotation Module
LOCATION
XML
EMail
Adapter
Adapter
SIGIR
Hotel
Sources
Date = 15-18 August
Website
Website
Graupmann
Homepage
VLDB 2005, Trondheim, Norway
Tourist
Guide
(XML)
16
Outline
• Where existing search engines fail
• SphereSearch Concepts
• Transformation and Annotation
• Query Language and Scoring
• Experimental Evaluation
• Current and Future Work
VLDB 2005, Trondheim, Norway
17
SphereSearch Queries
Extended keyword queries:
• similarity conditions
~professor, ~Saarbrücken
• concept-based conditions
person=Max Planck, location=Trondheim
• grouping
• join conditions
Ranked results with context-aware scoring
VLDB 2005, Trondheim, Norway
18
Score Aggregation: SphereScore
Local
score sL(e) for each
research
XML
element e (tf/idf, BM25,…)
1
2
2
1
s(1):
Weighted aggregation of local scores in
environment of element (sphere score):
D
s ( e)  
d 0

e ':
dist ( e ,e ')  d
 d sL (e '), 0    1
Rewards proximity
Context
of terms
and
compactness
of
awareness
term distribution
VLDB 2005, Trondheim, Norway
19
Similarity Conditions
Similarity conditions like
Thesaurus/Ontology:
~professor, ~Saarbrücken
concepts, relationships, glosses
from WordNet, Gazetteers,
Web forms & tables, Wikipedia
disambiguation
Query expansion
δ-exp(x)={w|sim(x,w)>δ}
Local score: weighted max
over all expansion terms
sL(e,~professor) =
max tδ-exp(professor)
{sim(professor,t)*sL(e,t)}
Abstraction
awareness
alchemist
primadonna
artist director
wizard
investigator
intellectual
researcher
professor
HYPONYM (0.7)
educator
scientist
scholar
academic,
academician,
faculty member
lecturer
mentor
teacher
relationships quantified by
statistical co-occurence measures
VLDB 2005, Trondheim, Norway
20
Concept-based conditions
Goal: Exploit explicit (tags) and automatic
annotations in documents
location=Trondheim
concept
value
docid=1
tag=„location“
e content=“Trondheim“
sL(e,c=v)= score for concept-tag match
+ score for value-content-match
conceptspecific
Allows similarity and range queries (for annotated
concepts) like
location~Trondheim
1970<date<1980
Concept
with concept-specific distance
awareness
measures
VLDB 2005, Trondheim, Norway
21
Query Groups
Goal: Related terms should occur in the
same context
Group conditions that relate to the same „entity“
professor teaching IR research XML
professor T(teaching IR) R(research XML)
SphereScore computed for each group
Find compact sets with one result for each group
VLDB 2005, Trondheim, Norway
22
Scores for Query Results
query result R: one result per query group
score( R)    si (ei )  (1   )compactness( R)
ei R
compactness ~ 1/size of a minimal spanning tree
A
3
1
1
X
2
C ( N 1) 
1
3
C ( N 2) 
1
4
C ( N 3) 
1
5
A
A
3
4
X
X
1
2
1
2
B
5
3
X
2
1
B
B
5
6
X
1
2
Context
awareness
1
C(N ) 
4
6
VLDB 2005, Trondheim, Norway
23
Join conditions
Goal: Connect results of different query groups
A(research, XML)
B(VLDB 2005 paper)
A.person=B.person
B
A
VLDB
research
2005
1.0
XML
Ralf
Schenkel
Dependent on database
size, application
• Precomputed
• Computed during
query execution
0.9
2004
2005
R.Schenkel
•Join conditions do not
change the score for a node
•Join conditions create a new
24
VLDB 2005,
Trondheim,
link
with Norway
a specific weight
Score for Join Conditions
Join condition A.T=B.S:
• For all nodes n1 with type T, n2 with type S,
add edge (n1,n2) with weight sim(n1,n2))-1
• sim(n1,n2): content-based similarity
A
B
2
3
X
1
C ( N 4) 
2
X
1
3
1
2
B
VLDB 2005, Trondheim, Norway
25
Outline
• Where existing search engines fail
• SphereSearch Concepts
• Transformation and Annotation
• Query Language and Scoring
• Experimental Evaluation
• Current and Future Work
VLDB 2005, Trondheim, Norway
26
Setup for Experiments
No existing benchmark (INEX, TREC, …) fits
Three corpora:
• Wikipedia
• extended Wikipedia with links to IMDB
• extended DBLP corpus with links to homepages
50 Queries like
• A(actor birthday 1970<date<1980) western
• G(California,governor) M(movie)
• A(Madonna,husband) B(director)
A.person=B.director
Opponent: keyword queries with standard TF/IDF-based score
 „simplified Google“
VLDB 2005, Trondheim, Norway
27
Incremental Language Levels
SSE-Join
(join conditions)
SSE-QG
(query groups)
SSE-CV
(concept-based conditions)
SSE-basic
(keywords, SphereScores)
VLDB 2005, Trondheim, Norway
28
Experimental Results on Wikipdia
VLDB 2005, Trondheim, Norway
29
Experimental Results on Wiki++ and DBLP++
• SphereScores better than local scores
• New SSE features nearly double precision
VLDB 2005, Trondheim, Norway
30
Current and Future Work
• Improve graphical user interface
• Refined type-specific similarity measures
(like geographic distances) [SIGIR-WS 2005]
• Deep Web search through automatic portal
queries
• Parameter tuning with relevance feedback
• Efficiency of query evaluation through
precomputation and integrated top-k
(TopX talk this afternoon)
VLDB 2005, Trondheim, Norway
31
Thank you!
VLDB 2005, Trondheim, Norway
32
Download