Fresh Information Delivery with Continual Queries Outline Lecture 1: Tracking

advertisement
s 4440
C
Lecture 1: Tracking Changes on the Web
Fresh Information Delivery
with Continual Queries
Ling Liu
Distributed Data Intensive systems Lab
College of Computing
Georgia Institute of Technology
1
© Ling Liu
Outline
‹ Motivation
‹ Part I: Continual Query Concept and
Continual Query Project
‹ Part II: CQ-Related Research issues
‹ Our Approach and Initial Evaluation
‹ WebCQ Demo
‹ XWRAP Toolkits
Ideas for Course Projects
© Ling Liu
2
Motivation
‹ Everyone
today can publish information on
the web independently and at any time
‹ Information sources are constantly and
rapidly changing
‹ These rapid and often unpredictable
changes create a new problem:
● detecting, representing, and notifying changes
Personalized Information Monitoring
3
© Ling Liu
Applications and Motivation
Transportation Route
Information
Monitor the weather in the region
of port of Savannah and Atlanta
over the next 3 months. Alert when
the weather condition is bad
(heavy wind or heavy rain).
User Requirement
Information
Monitoring
Service
Pushed Updates
or
Trigger re-planning
© Ling Liu
Weather Information
Truck Avail.
Information
4
Information Monitoring at Internet Scale:
What are the main challenges?
‹ Information
sources are
heterogeneous and autonomous
(non-cooperating)
‹ There
is a higher latency between
change notification and source changes
‹ Needs
techniques for efficient
execution of long-running distributed
queries and long standing distributed
triggers
5
© Ling Liu
Personalized Info Monitoring
‹ State
of art
● Manual process
■ locating and detecting information updates
manually
■ High latency, uncontrollable
● Application-specific programmed polling
■ Scalability difficulties
■ Extensibility and generality difficulties
© Ling Liu
6
Continual Query Project
‹ Goal:
Internet-scale Solution
● delivering right info to right users at right time
● scalability, extensibility, and responsiveness
‹ Methods
and Key Techniques:
Continual query concept
■ An extensible three-tier architecture
■ Mechanisms for efficient and scalable implementation of CQs
■
‹ Two
Running Systems
● OpenCQ: Monitoring changes in structured or semi-
structured data soruces
● WebCQ: Monitoring changes in arbitrary web pages
7
© Ling Liu
CQ Project Overview
Monitoring, Filtering, Notification by CQ
Internet
CQ Engines
Distr.Trig.
Query Results
x
Event Obs.
II
•••
II
•••
Filter
Sensors
Significant Changes
Wireless
© Ling Liu
8
Continual Query Concept
‹ Continual
Query: {Q, Trigger, STOP}
‹ Continual
Semantics
● CQ issued once and run until STOP
● When Trigger becomes true, Q is
evaluated
● New results of Q (since the previous
execution) will be returned
9
© Ling Liu
CQ Concept: An Example
Transportation Route
Information
Continual Query: {Q, Trigger, STOP}
Weather
Information
Q: Alternative transport routes
Trigger: heavy wind or rain
Install CQ
CQ Server
CQ updates
© Ling Liu
Truck
Information
STOP: three months
Monitor the weather in the region of
port of Savannah and Atlanta. Find
all alternate transport routes when
the weather condition is bad (heavy
wind or heavy rain).
Continual Query
10
CQ Triggers
‹ Time-based Triggers:
●
based on time events
■
every day at 10am, on the first day of each month
● Implementation feature: User-specified polling interval
‹ Content-based Triggers:
●
based on content update events
■
whenever the snow coverage at Rochers De Naye reaches 10 feet
send me the up-to-date train schedule
● Implementation feature: System-controlled polling
interval
11
© Ling Liu
Example Continual Queries
‹ Report to me the total amount and classification of
materials coming into or going out from these ports every
5 hours.
‹ Notify me in the next six months whenever the inventory
level of 120 mm ammunition drops by 5%.
‹ Notify me whenever an airplane has been in the sector A
for more than 5 minutes.
‹ Report to me everyday at 10:00 am if the demand of any
stocked item published on this web site is higher than the
planned inventory.
© Ling Liu
12
Part 2: Research Issues
‹General
research issues for
Querying and Search on the Web
‹CQ
related research issues
13
© Ling Liu
Research Issues in WebDB
‹ architecture
for interoperable and scalable
global information systems
■ Client
and Server or Peer-to-Peer
■ mediator-wrapper or multi-agent architecture
‹ distributed
query routing
‹ distributed catalog management
‹ distributed multi-layered indexing techniques
‹ distributed query optimization
‹ distributed query result assembly
…...
Incorporating Runtime Info is critical in addressing all these issues
© Ling Liu
14
CQ-related Research Issues
● Distributed event-driven architecture
■
five models: objects, events, observation, notification, resources
■
[Liu+TKDE97]
Client-Server or P2P
● Performance
■
efficient execute of CQs
● Coverage
■
types of changes capable to capture
● Reliability
■
CQ System Recovery (server end) and Application Recovery (client end)
● Scalability
■
number of data sources (e.g., 1000) and number of users (e.g., 10,000)
© Ling Liu
15
Research Problems
‹ Efficient
Execution of Continual Queries
● Change Detection Problems
■ No explicit support for synchronous triggers
■ Update events (operations) occur autonomously
■ No built-in triggers at the data source sites (few data
producers publish the trigger facility or the native data
updates)
● Differential evaluation of Continual Queries
■ Naïve v.s. DRA algorithms
‹ Scalable
Distributed Trigger Processing
● Tens or hundreds of thousands of triggers firing at
thousands of data sources.
© Ling Liu
16
Efficient Execution of CQs
‹A
model for efficient execution of CQs
● brute-force (naive) algorithm
● differential re-evaluation algorithm (DRA)
‹A
model for efficient detection of simple
and composite events of changes
● Primitive and Composite Event Handling
(specification, detection, notification)
17
© Ling Liu
Continual Semantics Revisited
‹ Continual
Query: {Q, Trigger, STOP}
‹ Continual Semantics: the results of a
continual query is the set of data that would
be returned if they were executed at every
instant in time.
Qcq(t) = ∪x ≤ t Q(x)
■ Qcq(t):
the total set of data returned up to time t
by executing Q as a continual query
■ Q(t): the result of running Q at time t.
When a query Q is executed with continual
semantics, it returns Qcq(t) not Q(t) .
© Ling Liu
18
Efficient Execution of CQs
‹A
model for efficient execution of CQs
● brute-force (naive) algorithm
● differential re-evaluation algorithm
(DRA)
19
© Ling Liu
CQ Execution: Naive Algorithm
set t = − ∞ ,
set Q(t) = ∅
while stop <> true do
set tprev= t, set t := current time
If Q(tprev) = ∅
// first run
Then
execute queries Qcq(t),
display Qcq(t);
Else
If trig = false then
sleep
Else
execute queries Qcq(t) and Qcq(tprev)
return Qcq(t) − Qcq(tprev)
© Ling Liu
20
Incremental Query Evaluation
‹ Problem
with the Naive Approach
● when answer(Q) involves a large collection of
data sources and the update between tprev and t
is relatively small, naive approach is inefficient
● Example:
Q := R × S and there is one update during the
period of (tprev, t) : insert(e, R).
■ Compute {e} × S is much cheaper than reevaluate Q, especially when S is relatively
smaller than R.
■ We call {e} × S an incremental query, denoted
as ∆Q(tprev, t )
21
© Ling Liu
CQ Execution: DRA Algorithm
set t = − ∞ , set Q(t) = ∅
while stop <> true do
set tprev= t, set t := current time
If Q(tprev) = ∅
// first run
Then
execute queries Qcq(t),
display Qcq(t);
Else
If trig = false then
sleep
Else
execute query ∆Q(tprev, t )
return Qcq(t) − Qcq(tprev), the diff result to user
© Ling Liu
[Liu+ICDCS96]
22
Research Challenges
‹ Brute
force v.s. DRA:
■ Algorithms
for effectively transforming
arbitrary CQs into delta CQs [ICDCS96]
■ When is the DRA beneficial and for which
types of data sources and what types of CQs?
■ Algorithms for efficient caching of CQ previous
execution results
■ Techniques for efficient/scalable trigger
condition evaluation
‹
‹
‹
Tcq1 = (E1 , E2 , E3)
Tcq2 = (E1 | E2 | E3)
Tcq3 = (E1, E5) ...
23
© Ling Liu
Efficient Event Detection
‹ Two
classes of event detection methods
● Synchronous approach
■ an event occurrence is communicated explicitly to
and in synchronization with the event observer
■ Typical example: DB built-in triggers
● Asynchronous (Polling) approach
■ the server periodically checks for the occurrence
of an event
■ All third-party monitoring services are of this type
© Ling Liu
24
Change Detection: Polling Approach
‹ Problem
Statement:
Given two snapshots generated from two different
polling time points, find the difference between
these two snapshots?
I.e., compare the two snapshots and discover
‹
‹
‹
‹
what has been inserted?
what has been modified?
what has been removed?
etc.
‹ Difference
Algorithms
GNU diff utility [HHS+-MIT]
ediff program [Kifer95] (Emacs), etc.
■ LaDiff program [CRHW96-stanford]
■
■
25
© Ling Liu
Polling Approach: Basic Concepts
‹ Representing
each snapshot using a generic
data structure
■ such
as using an ordered tree
■ good for documents, HTML pages, LaTex files
‹ Define
a set of change operations for
capturing update types
‹ Define
an Edit Script
operations
ε = a sequence of update
■ Primitive
change operations include INS(node),
DEL(node), UPD(node), COPY(node),
MOVE(node), etc.
© Ling Liu
26
Polling Approach: Formal Model
‹ Problem
Statement:
Given two rooted, labeled trees T1 and T2,
find the edit script ε of the lowest cost
ε
‹
transforms T1 to a tree that is isomorphic
to T2 and
ε ‘ (T1, T2), the following
property holds: Cost(ε ) < Cost(ε ‘).
‹ for any edit script
27
© Ling Liu
Polling Approach: Optimization
‹
Known Problem: Typical difference algorithm over two
trees of n node runs in time O(n2log2n) for balanced tree
and even higher for unbalanced tree [ZhangShasha89]
‹
There are several ways to improve the diff performance:
● to utilize domain-specific language to capture the
portions of the tree that users are interested in
monitoring information changes
■ by coloring those portions of the tree that need to
be continually watched for updates
■ thus reduce the problem to diff algorithm among
two colored trees with colored nodes m < n.
© Ling Liu
28
Research Problems
‹ Efficient
Execution of Continual Queries
● Change Detection Problems
● Differential evaluation of Continual Queries
‹ Scalable
Distributed Trigger Processing
● Hundreds of thousands of triggers firing at thousands
of data sources.
29
© Ling Liu
Scalability
‹ What
architecture will allow the CQ
system
● to efficiently organize and partition its change
detection task,
● to handle notification to multiple applications or
users interested in the same event(s)
● to characterize events involving multiple and
possibly heterogeneous components (data sources)
● ultimately, to provide robust and painless support
more than 10,000 users over more than 1000 data
sources
© Ling Liu
30
Initial Results
‹ Using
subscription grouping/indexing
techniques
● to group CQs that have the same or similar trigger
structure (trigger pattern) into one group,
● create a polling query for each group of CQs,
● using main-memory and disk-based organization for
CQ indexes.
● Support grouping at trigger level, query level,
notification level and data level
31
© Ling Liu
Benefits of CQ Grouping
Trigger Evaluation Time (Sec)
3600
3300
3000
2700
2400
2100
#ofGroups=100
#ofGroups=1000
#ofGroups=2000
No Grouping
1800
1500
1200
900
600
300
0
© Ling Liu
Ncq=8,000
Ncq=10,000
32
WebCQ Architecture
33
© Ling Liu
WebCQ Live Demo
http://disl.cc.gatech.edu/WebCQ
© Ling Liu
34
Demo Walkthrough
© Ling Liu
35
© Ling Liu
36
© Ling Liu
37
© Ling Liu
38
© Ling Liu
39
© Ling Liu
40
© Ling Liu
41
© Ling Liu
42
43
© Ling Liu
WebCQ for Mobile Clients
Request for registration,
installation of sentinel or updates
Mobile
Client
Content adapted
update for clients
Request forwarding
Mobile
Adaptor
Profile DB
© Ling Liu
Monitoring
updates
WebCQ
Server
Metadata
Repository
44
WebCQ for Wireless Palm
Scenario 1:
Palm Query Application
Scenario 2:
Java MIDlet Client
Entrance
Sentinel Installation
Sentinel Installation
WebCQ Notification
© Ling Liu
45
WebCQ for cell phones
Sentinel Installation
Choose
category
Stock (IBM)
Choose target
source URL
Get updates
Entrance
© Ling Liu
Client Login
46
Example WebCQ Applications
‹ News,
stock ticker
‹ Traffic monitoring
‹ Web site aggregation
‹ Interest Recommendation
…
47
© Ling Liu
Captured @ 3pm, 8/1/2002
© Ling Liu
48
Ongoing Research
Sensors
Infotaps &
Fat Clients
Cluster
of Servers
Heterogeneous
Information
Sources
49
© Ling Liu
P2P: Big Technical Challenges
‹ Ability
to efficiently distribute and partition
services among (active) peers
● Dynamic load balancing problems
‹ Ability
to efficiently find files and services
from a potentially huge number of peers
● data placement problems
[Gribble, Halevy, Ives, Rodrig, Suciu 01]
[Clark 01]
© Ling Liu
50
Available Services & Tools
‹ The
OpenCQ system
http://www.cc.gatech.edu/projects/disl/CQ/
● The NT version is downloadable from
http://www.cc.gatech.edu/projects/disl/CQ/plu
gin/
‹ The
WebCQ system
http://www.cc.gatech.edu/projects/disl/WebCQ/
Open Source download
51
© Ling Liu
Application Service
Development
Mediator-Wrapper Technology
(XWRAP toolkits)
© Ling Liu
52
Motivation: Why Wrapper Technology
‹ Web:
vast number of information sources
‹ Search Engines: first and last resort
‹ The Next Big Challenge - Interoperability
● Data Extraction and Data Interpretation
● Data Interoperation among applications
● Common approach: Using Wrappers
● Key challenges: Scalability and Evolution
53
© Ling Liu
Why Wrappers are useful
‹ Wrappers
hide the heterogeneity and
enhance scalability in information
Wrapper
integration systems
Mediator
Mediator
Wrapper
Wrapper
Wrapper
Wrapper
Junglee and Jango
initial success in industry
© Ling Liu
Mediator
Wrapper
54
What is a Wrapper?
‹
Wrapper is a software program, designed for
● extracting and mapping the source information
content into a more structured format; (Data
Wrapping)
● performing content filtering to answer contentsensitive queries over an individual web site.
(function wrapping)
An individual Web Site
NL query or
XML-QL-like
query
Structured
data object(s)
Keyword
search
Search
(query transformation)
html/text Software
Wrapper
HTTP query
HTML
HTML
HTML
document
document
Web
document
document
document
55
© Ling Liu
Design Choice of a Wrapper
‹ Light-weighted
wrapper
● simple transformation of a mediator request
to an executable method call to the remote
web site
● Example: Oracle Wrapper/gateway
‹ Heavy-weighted
wrapper
● this type of wrapper is needed when the data
manager at the remote data source site has
less capability
● Example: Enhanced search tool for Amazon.com
© Ling Liu
56
Using Wrappers: Example Applications
‹ Class
1 Applications
● offer an integrated search service over
heterogeneous information providers
● Example:
‹
‹
‹ Class
shopping comparison agent
Metacrawler
2 Applications
● offer advanced aggregation/summarization
service over a heterogeneous collection of webbased information providers
● Example:
‹
‹
supplier chain management
Aggregation Portal Service
57
© Ling Liu
Wrapper Construction
‹
A main challenge in wrapper construction
● discover boundaries of meaningful objects in a web
document or a collection of web documents
● distinguish the information content from their
metadata description
● Recognize and encode the metadata explicitly
HTML source document
Wrapper Developer’s
Information Extraction
Knowledge
© Ling Liu
OO representation
Relational representation
XML representation
58
Example
Application
Search books
by author
Wrapper
Mediator
SQL-like query
Structural format,
such as XML,
relation table
Wrapper
<book name=“After the Quake”>…</book>
url
Web Pages
59
© Ling Liu
XWRAP Family
‹ XWRAP Original
● One of the first semi-structured Java wrapper generation
systems with interactive GUI
● Generate Java wrappers in a couple of hours compared to
days and weeks by hand
● Published in SIGMOD 1999 (short paper), ICDE 2000, IJIS
2001
‹ XWRAP Elite
● The first Web-based Wrapper Code Generator with
automated information extraction capability
● Allow anyone to generate Java code on the fly in minutes
● 500+ users, 2000+ wrappers generated
● Published in SIGMOD 2000 (short paper), SIGMOD Record
2001, Used in OpenCQ system reported in IJCS 2001
© Ling Liu
60
XWRAP Family (cont.)
‹ XWRAP
Composer
● The first composable wrapper application
generation system that supports multi-page
information extraction
● Novel design framework
■
■
composer interface/outerface description
composer scripting language
‹
‹
Specifying query-answer control logics
Specifying information extraction logics
● Used for DoE SciDAC effort for Scientific workflow
Process Applications
● Tested on five different Bioinformatic data sources
■
NCBI, GenBank, Clusfavor, PDB, Transfac
61
© Ling Liu
Xwrap Elite Approach
© Ling Liu
More animations about elements
62
XWRAP Elite Approach
‹
‹
‹
‹
‹
‹
‹
‹
‹
‹
‹
‹
<book>
<booklink>http://…</booklink>
<title>After the Quake</title>
<shipping>In Stock:Ships with 24</shipping>
<author>Haruki Murakami, Jay Rubin(Translator)</author>
<format>Hardcover</format>
<publisher>Knopf Alfred A</publisher>
<time>August 2002</time>
<price>$14.70</price>
<save>30%</save>
</book>
<book> … </book>
63
© Ling Liu
XWRAPElite Architecture
Doc.
Subtree
Subtree
Extraction
Object
Separation
Object
Extraction
Objects
Object
Pruning
Automated Process
© Ling Liu
Elements
XML
Element
Output
Extraction
Tagging
Element
Alignment
Tagging
Human Input
64
An Example Usage of XWrap
Wrappers
‹ Search
query
● comparing the
prices of all the
books on JDBC
‹ Sentinel
● notify me whenever
there is a new book
coming out on Java
Threads
© Ling Liu
65
An Example Usage
of XWrap Wrappers
© Ling Liu
66
67
© Ling Liu
Query Planning and Execution
© Ling Liu
68
Query Results
69
© Ling Liu
Applications of XWrap Wrappers
‹ The
Continual Queries Project
● Wrappers are used for
■ intelligent mediation of information from multiple
heterogeneous data sources
■ creating and maintaining source content and
capability profiles
■ supporting query routing and other query
optimizations
■ constructing change detectors for Web
information sources
© Ling Liu
● An Example
■ Book Shopping and Price Comparison/Tracking
Agent
70
URL
‹ The
XWRAPElite system
http://disl.cc.gatech.edu//XWRAPElite/
Open Source downloadable
71
© Ling Liu
Questions ?
© Ling Liu
72
Download