s 4440 C Lecture 1: Tracking Changes on the Web Fresh Information Delivery with Continual Queries Ling Liu Distributed Data Intensive systems Lab College of Computing Georgia Institute of Technology 1 © Ling Liu Outline Motivation Part I: Continual Query Concept and Continual Query Project Part II: CQ-Related Research issues Our Approach and Initial Evaluation WebCQ Demo XWRAP Toolkits Ideas for Course Projects © Ling Liu 2 Motivation Everyone today can publish information on the web independently and at any time Information sources are constantly and rapidly changing These rapid and often unpredictable changes create a new problem: ● detecting, representing, and notifying changes Personalized Information Monitoring 3 © Ling Liu Applications and Motivation Transportation Route Information Monitor the weather in the region of port of Savannah and Atlanta over the next 3 months. Alert when the weather condition is bad (heavy wind or heavy rain). User Requirement Information Monitoring Service Pushed Updates or Trigger re-planning © Ling Liu Weather Information Truck Avail. Information 4 Information Monitoring at Internet Scale: What are the main challenges? Information sources are heterogeneous and autonomous (non-cooperating) There is a higher latency between change notification and source changes Needs techniques for efficient execution of long-running distributed queries and long standing distributed triggers 5 © Ling Liu Personalized Info Monitoring State of art ● Manual process ■ locating and detecting information updates manually ■ High latency, uncontrollable ● Application-specific programmed polling ■ Scalability difficulties ■ Extensibility and generality difficulties © Ling Liu 6 Continual Query Project Goal: Internet-scale Solution ● delivering right info to right users at right time ● scalability, extensibility, and responsiveness Methods and Key Techniques: Continual query concept ■ An extensible three-tier architecture ■ Mechanisms for efficient and scalable implementation of CQs ■ Two Running Systems ● OpenCQ: Monitoring changes in structured or semi- structured data soruces ● WebCQ: Monitoring changes in arbitrary web pages 7 © Ling Liu CQ Project Overview Monitoring, Filtering, Notification by CQ Internet CQ Engines Distr.Trig. Query Results x Event Obs. II ••• II ••• Filter Sensors Significant Changes Wireless © Ling Liu 8 Continual Query Concept Continual Query: {Q, Trigger, STOP} Continual Semantics ● CQ issued once and run until STOP ● When Trigger becomes true, Q is evaluated ● New results of Q (since the previous execution) will be returned 9 © Ling Liu CQ Concept: An Example Transportation Route Information Continual Query: {Q, Trigger, STOP} Weather Information Q: Alternative transport routes Trigger: heavy wind or rain Install CQ CQ Server CQ updates © Ling Liu Truck Information STOP: three months Monitor the weather in the region of port of Savannah and Atlanta. Find all alternate transport routes when the weather condition is bad (heavy wind or heavy rain). Continual Query 10 CQ Triggers Time-based Triggers: ● based on time events ■ every day at 10am, on the first day of each month ● Implementation feature: User-specified polling interval Content-based Triggers: ● based on content update events ■ whenever the snow coverage at Rochers De Naye reaches 10 feet send me the up-to-date train schedule ● Implementation feature: System-controlled polling interval 11 © Ling Liu Example Continual Queries Report to me the total amount and classification of materials coming into or going out from these ports every 5 hours. Notify me in the next six months whenever the inventory level of 120 mm ammunition drops by 5%. Notify me whenever an airplane has been in the sector A for more than 5 minutes. Report to me everyday at 10:00 am if the demand of any stocked item published on this web site is higher than the planned inventory. © Ling Liu 12 Part 2: Research Issues General research issues for Querying and Search on the Web CQ related research issues 13 © Ling Liu Research Issues in WebDB architecture for interoperable and scalable global information systems ■ Client and Server or Peer-to-Peer ■ mediator-wrapper or multi-agent architecture distributed query routing distributed catalog management distributed multi-layered indexing techniques distributed query optimization distributed query result assembly …... Incorporating Runtime Info is critical in addressing all these issues © Ling Liu 14 CQ-related Research Issues ● Distributed event-driven architecture ■ five models: objects, events, observation, notification, resources ■ [Liu+TKDE97] Client-Server or P2P ● Performance ■ efficient execute of CQs ● Coverage ■ types of changes capable to capture ● Reliability ■ CQ System Recovery (server end) and Application Recovery (client end) ● Scalability ■ number of data sources (e.g., 1000) and number of users (e.g., 10,000) © Ling Liu 15 Research Problems Efficient Execution of Continual Queries ● Change Detection Problems ■ No explicit support for synchronous triggers ■ Update events (operations) occur autonomously ■ No built-in triggers at the data source sites (few data producers publish the trigger facility or the native data updates) ● Differential evaluation of Continual Queries ■ Naïve v.s. DRA algorithms Scalable Distributed Trigger Processing ● Tens or hundreds of thousands of triggers firing at thousands of data sources. © Ling Liu 16 Efficient Execution of CQs A model for efficient execution of CQs ● brute-force (naive) algorithm ● differential re-evaluation algorithm (DRA) A model for efficient detection of simple and composite events of changes ● Primitive and Composite Event Handling (specification, detection, notification) 17 © Ling Liu Continual Semantics Revisited Continual Query: {Q, Trigger, STOP} Continual Semantics: the results of a continual query is the set of data that would be returned if they were executed at every instant in time. Qcq(t) = ∪x ≤ t Q(x) ■ Qcq(t): the total set of data returned up to time t by executing Q as a continual query ■ Q(t): the result of running Q at time t. When a query Q is executed with continual semantics, it returns Qcq(t) not Q(t) . © Ling Liu 18 Efficient Execution of CQs A model for efficient execution of CQs ● brute-force (naive) algorithm ● differential re-evaluation algorithm (DRA) 19 © Ling Liu CQ Execution: Naive Algorithm set t = − ∞ , set Q(t) = ∅ while stop <> true do set tprev= t, set t := current time If Q(tprev) = ∅ // first run Then execute queries Qcq(t), display Qcq(t); Else If trig = false then sleep Else execute queries Qcq(t) and Qcq(tprev) return Qcq(t) − Qcq(tprev) © Ling Liu 20 Incremental Query Evaluation Problem with the Naive Approach ● when answer(Q) involves a large collection of data sources and the update between tprev and t is relatively small, naive approach is inefficient ● Example: Q := R × S and there is one update during the period of (tprev, t) : insert(e, R). ■ Compute {e} × S is much cheaper than reevaluate Q, especially when S is relatively smaller than R. ■ We call {e} × S an incremental query, denoted as ∆Q(tprev, t ) 21 © Ling Liu CQ Execution: DRA Algorithm set t = − ∞ , set Q(t) = ∅ while stop <> true do set tprev= t, set t := current time If Q(tprev) = ∅ // first run Then execute queries Qcq(t), display Qcq(t); Else If trig = false then sleep Else execute query ∆Q(tprev, t ) return Qcq(t) − Qcq(tprev), the diff result to user © Ling Liu [Liu+ICDCS96] 22 Research Challenges Brute force v.s. DRA: ■ Algorithms for effectively transforming arbitrary CQs into delta CQs [ICDCS96] ■ When is the DRA beneficial and for which types of data sources and what types of CQs? ■ Algorithms for efficient caching of CQ previous execution results ■ Techniques for efficient/scalable trigger condition evaluation Tcq1 = (E1 , E2 , E3) Tcq2 = (E1 | E2 | E3) Tcq3 = (E1, E5) ... 23 © Ling Liu Efficient Event Detection Two classes of event detection methods ● Synchronous approach ■ an event occurrence is communicated explicitly to and in synchronization with the event observer ■ Typical example: DB built-in triggers ● Asynchronous (Polling) approach ■ the server periodically checks for the occurrence of an event ■ All third-party monitoring services are of this type © Ling Liu 24 Change Detection: Polling Approach Problem Statement: Given two snapshots generated from two different polling time points, find the difference between these two snapshots? I.e., compare the two snapshots and discover what has been inserted? what has been modified? what has been removed? etc. Difference Algorithms GNU diff utility [HHS+-MIT] ediff program [Kifer95] (Emacs), etc. ■ LaDiff program [CRHW96-stanford] ■ ■ 25 © Ling Liu Polling Approach: Basic Concepts Representing each snapshot using a generic data structure ■ such as using an ordered tree ■ good for documents, HTML pages, LaTex files Define a set of change operations for capturing update types Define an Edit Script operations ε = a sequence of update ■ Primitive change operations include INS(node), DEL(node), UPD(node), COPY(node), MOVE(node), etc. © Ling Liu 26 Polling Approach: Formal Model Problem Statement: Given two rooted, labeled trees T1 and T2, find the edit script ε of the lowest cost ε transforms T1 to a tree that is isomorphic to T2 and ε ‘ (T1, T2), the following property holds: Cost(ε ) < Cost(ε ‘). for any edit script 27 © Ling Liu Polling Approach: Optimization Known Problem: Typical difference algorithm over two trees of n node runs in time O(n2log2n) for balanced tree and even higher for unbalanced tree [ZhangShasha89] There are several ways to improve the diff performance: ● to utilize domain-specific language to capture the portions of the tree that users are interested in monitoring information changes ■ by coloring those portions of the tree that need to be continually watched for updates ■ thus reduce the problem to diff algorithm among two colored trees with colored nodes m < n. © Ling Liu 28 Research Problems Efficient Execution of Continual Queries ● Change Detection Problems ● Differential evaluation of Continual Queries Scalable Distributed Trigger Processing ● Hundreds of thousands of triggers firing at thousands of data sources. 29 © Ling Liu Scalability What architecture will allow the CQ system ● to efficiently organize and partition its change detection task, ● to handle notification to multiple applications or users interested in the same event(s) ● to characterize events involving multiple and possibly heterogeneous components (data sources) ● ultimately, to provide robust and painless support more than 10,000 users over more than 1000 data sources © Ling Liu 30 Initial Results Using subscription grouping/indexing techniques ● to group CQs that have the same or similar trigger structure (trigger pattern) into one group, ● create a polling query for each group of CQs, ● using main-memory and disk-based organization for CQ indexes. ● Support grouping at trigger level, query level, notification level and data level 31 © Ling Liu Benefits of CQ Grouping Trigger Evaluation Time (Sec) 3600 3300 3000 2700 2400 2100 #ofGroups=100 #ofGroups=1000 #ofGroups=2000 No Grouping 1800 1500 1200 900 600 300 0 © Ling Liu Ncq=8,000 Ncq=10,000 32 WebCQ Architecture 33 © Ling Liu WebCQ Live Demo http://disl.cc.gatech.edu/WebCQ © Ling Liu 34 Demo Walkthrough © Ling Liu 35 © Ling Liu 36 © Ling Liu 37 © Ling Liu 38 © Ling Liu 39 © Ling Liu 40 © Ling Liu 41 © Ling Liu 42 43 © Ling Liu WebCQ for Mobile Clients Request for registration, installation of sentinel or updates Mobile Client Content adapted update for clients Request forwarding Mobile Adaptor Profile DB © Ling Liu Monitoring updates WebCQ Server Metadata Repository 44 WebCQ for Wireless Palm Scenario 1: Palm Query Application Scenario 2: Java MIDlet Client Entrance Sentinel Installation Sentinel Installation WebCQ Notification © Ling Liu 45 WebCQ for cell phones Sentinel Installation Choose category Stock (IBM) Choose target source URL Get updates Entrance © Ling Liu Client Login 46 Example WebCQ Applications News, stock ticker Traffic monitoring Web site aggregation Interest Recommendation … 47 © Ling Liu Captured @ 3pm, 8/1/2002 © Ling Liu 48 Ongoing Research Sensors Infotaps & Fat Clients Cluster of Servers Heterogeneous Information Sources 49 © Ling Liu P2P: Big Technical Challenges Ability to efficiently distribute and partition services among (active) peers ● Dynamic load balancing problems Ability to efficiently find files and services from a potentially huge number of peers ● data placement problems [Gribble, Halevy, Ives, Rodrig, Suciu 01] [Clark 01] © Ling Liu 50 Available Services & Tools The OpenCQ system http://www.cc.gatech.edu/projects/disl/CQ/ ● The NT version is downloadable from http://www.cc.gatech.edu/projects/disl/CQ/plu gin/ The WebCQ system http://www.cc.gatech.edu/projects/disl/WebCQ/ Open Source download 51 © Ling Liu Application Service Development Mediator-Wrapper Technology (XWRAP toolkits) © Ling Liu 52 Motivation: Why Wrapper Technology Web: vast number of information sources Search Engines: first and last resort The Next Big Challenge - Interoperability ● Data Extraction and Data Interpretation ● Data Interoperation among applications ● Common approach: Using Wrappers ● Key challenges: Scalability and Evolution 53 © Ling Liu Why Wrappers are useful Wrappers hide the heterogeneity and enhance scalability in information Wrapper integration systems Mediator Mediator Wrapper Wrapper Wrapper Wrapper Junglee and Jango initial success in industry © Ling Liu Mediator Wrapper 54 What is a Wrapper? Wrapper is a software program, designed for ● extracting and mapping the source information content into a more structured format; (Data Wrapping) ● performing content filtering to answer contentsensitive queries over an individual web site. (function wrapping) An individual Web Site NL query or XML-QL-like query Structured data object(s) Keyword search Search (query transformation) html/text Software Wrapper HTTP query HTML HTML HTML document document Web document document document 55 © Ling Liu Design Choice of a Wrapper Light-weighted wrapper ● simple transformation of a mediator request to an executable method call to the remote web site ● Example: Oracle Wrapper/gateway Heavy-weighted wrapper ● this type of wrapper is needed when the data manager at the remote data source site has less capability ● Example: Enhanced search tool for Amazon.com © Ling Liu 56 Using Wrappers: Example Applications Class 1 Applications ● offer an integrated search service over heterogeneous information providers ● Example: Class shopping comparison agent Metacrawler 2 Applications ● offer advanced aggregation/summarization service over a heterogeneous collection of webbased information providers ● Example: supplier chain management Aggregation Portal Service 57 © Ling Liu Wrapper Construction A main challenge in wrapper construction ● discover boundaries of meaningful objects in a web document or a collection of web documents ● distinguish the information content from their metadata description ● Recognize and encode the metadata explicitly HTML source document Wrapper Developer’s Information Extraction Knowledge © Ling Liu OO representation Relational representation XML representation 58 Example Application Search books by author Wrapper Mediator SQL-like query Structural format, such as XML, relation table Wrapper <book name=“After the Quake”>…</book> url Web Pages 59 © Ling Liu XWRAP Family XWRAP Original ● One of the first semi-structured Java wrapper generation systems with interactive GUI ● Generate Java wrappers in a couple of hours compared to days and weeks by hand ● Published in SIGMOD 1999 (short paper), ICDE 2000, IJIS 2001 XWRAP Elite ● The first Web-based Wrapper Code Generator with automated information extraction capability ● Allow anyone to generate Java code on the fly in minutes ● 500+ users, 2000+ wrappers generated ● Published in SIGMOD 2000 (short paper), SIGMOD Record 2001, Used in OpenCQ system reported in IJCS 2001 © Ling Liu 60 XWRAP Family (cont.) XWRAP Composer ● The first composable wrapper application generation system that supports multi-page information extraction ● Novel design framework ■ ■ composer interface/outerface description composer scripting language Specifying query-answer control logics Specifying information extraction logics ● Used for DoE SciDAC effort for Scientific workflow Process Applications ● Tested on five different Bioinformatic data sources ■ NCBI, GenBank, Clusfavor, PDB, Transfac 61 © Ling Liu Xwrap Elite Approach © Ling Liu More animations about elements 62 XWRAP Elite Approach <book> <booklink>http://…</booklink> <title>After the Quake</title> <shipping>In Stock:Ships with 24</shipping> <author>Haruki Murakami, Jay Rubin(Translator)</author> <format>Hardcover</format> <publisher>Knopf Alfred A</publisher> <time>August 2002</time> <price>$14.70</price> <save>30%</save> </book> <book> … </book> 63 © Ling Liu XWRAPElite Architecture Doc. Subtree Subtree Extraction Object Separation Object Extraction Objects Object Pruning Automated Process © Ling Liu Elements XML Element Output Extraction Tagging Element Alignment Tagging Human Input 64 An Example Usage of XWrap Wrappers Search query ● comparing the prices of all the books on JDBC Sentinel ● notify me whenever there is a new book coming out on Java Threads © Ling Liu 65 An Example Usage of XWrap Wrappers © Ling Liu 66 67 © Ling Liu Query Planning and Execution © Ling Liu 68 Query Results 69 © Ling Liu Applications of XWrap Wrappers The Continual Queries Project ● Wrappers are used for ■ intelligent mediation of information from multiple heterogeneous data sources ■ creating and maintaining source content and capability profiles ■ supporting query routing and other query optimizations ■ constructing change detectors for Web information sources © Ling Liu ● An Example ■ Book Shopping and Price Comparison/Tracking Agent 70 URL The XWRAPElite system http://disl.cc.gatech.edu//XWRAPElite/ Open Source downloadable 71 © Ling Liu Questions ? © Ling Liu 72