pptx

advertisement
Databases & Information
Retrieval
Maya Ramanath
(Further Reading:
Combining Database and Information-Retrieval Techniques for Knowledge Discovery. G.
Weikum, G. Kasneci, M. Ramanath and F.M. Suchanek, CACM, April 2009
DB & IR: Both Sides Now. G. Weikum, Keynote at SIGMOD 2007)
DB and IR: Different Motivations
• Both deal with large amounts of information,
but…
DB
IR
Applications
online reservation,
banking
libraries
Emphasis
data consistency,
efficiency
result quality, user
satisfaction
Data
structured records
unstructured text
Queries
precise
interpretations vary
Results
exact match/all
results
ranked/top-k results
Why Combine Now?
• The applications drive the need
– The need to manage both structured and
unstructured data in an integrated manner
• Healthcare example
– Find young patients in central Europe who have been
reported, in the last two weeks, to have symptoms of
tropical virus diseases and an indication of
anomalies.
• Newspaper archives, product catalogues, etc.
Integrating DB & IR
Untructured queries /
ranked results
(keywords/top-k)
Structured queries /
boolean match results
(SQL)
top-k processing,
keyword search
queryon
processing IRforSystems
graphs
text search,
effective query
interfaces,
ranking for structured
extracting entities
data
DB Systems
and relationships,
ranking for entities
Structured data
(relational)
Unstructured data
(text)
Modules
1.
2.
3.
4.
5.
Top-k processing
Query Processing and Interfaces
Keyword Search on Graphs
Entity and Relationship Extraction
Ranking and Structured Data
1. Top-k Processing (1/2)
• Structured data, with scores in multiple
dimensions
• Return the top-k “objects”
Car
Color
Car
Mileage
Car
Service
BMW X1
0.9
Honda City
0.8
Tata Nano
0.7
Honda City
0.8
Maruti Swift 0.6
Maruti Swift 0.6
Maruti Swift 0.6
Tata Nano
0.3
Honda City
0.3
Tata Nano
BMW X1
0.1
BMW X1
0.1
0.1
Score(O) =
å
iÎ
{color,
mileage,
service}
Si (O)
1. Top-k Processing (2/2)
• Top-k Joins
– Example: Return the best house-school pair
Houses Rating Location
Schools Rating Location
H1
0.9
L1
S1
0.4
L2
H2
0.8
L2
S2
0.2
L2
H3
0.6
L3
S3
0.8
L3
H4
0.1
L3
S4
0.1
L3
2. Query Processing and Interfaces
(1/3)
• Given: Database of text documents and a textcentric task.
– Extract information about disease outbreaks
• Strategies
– Scan all documents – very expensive
– Filter promising documents – affects recall
• Develop cost models and execution strategies
appropriate for this setting
2. Query Processing and Interfaces
(2/3)
Querying with “typed” keywords
• Keyword querying: Easy to use
• Structured queries: Precise
Find the middle ground…
Instead of
“german has won nobel award”
q(X) :- GERMAN(x), hasWonPrize(x,y), NOBEL_PRIZE(y)
“german, has won (nobel award)”
2. Query Processing and Interfaces
(3/3)
WWW 2010 • Full Paper
April 26-30 • Raleigh • NC • US
• Does the output have to be a boring list of ranked
results?
• Nope !
Figure 1: The faceted retrieval interface of Facetedpedia.
3. Keyword Search on Graphs (1/3)
• Lots of graphs around
– Relational DB (tuples+foreign keys)
– XML data (elements/sub-elements/id/idrefs)
– RDF (graph-structured knowledge-bases)
• Easy to query with keywords, instead of
SQL/XQuery/SPARQL
• Results are the top-k interconnections between
the keywords
3. Keyword Search on Graphs (2/3)
3. Keyword Search on Graphs (3/3)
Query: “Einstein”, “Bohr”
vegetarian
isa
isa
Tom
Cruise
bornIn
Einstein
won
Nobel Prize
won
1962
Bohr
diedIn
4. Entity and Relationship Extraction
(1/2)
Information Extraction (or Knowledge Harvesting)
Bill Gates was
the founder of
Microsoft and
later it’s CEO.
Apple was
established on April
1, 1976 by Steve
Jobs, Steve
Wozniak, and
Ronald Wayne.
Infosys was founded on 2
July 1981 by seven
entrepreneurs: N. R.
Narayana Murthy,
Nandan Nilekani, …
Company
Founder
Microsoft
Bill Gates
Apple
Steve Jobs
Apple
Steve Wozniak
Infosys
N. R. Narayana Murthy
4. Entity and Relationship Extraction
(2/2)
• How to build a knowledge-base of facts?
– Structurize Wikipedia
– Construct rules for extraction
• How do I acquire all the facts in the world?
– Extract “everything”
– Don’t stop extracting
5. Ranking and Structured Data
• Not the same as top-k processing
• Given: Data with stucture in it
– Relational tables (flat)
– XML (trees/graphs)
– Text documents consisting of entities
• Task: Rank the query results
– SQL/Xquery/”typed” keywords
QUESTIONS?
Download