Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California Motivating Example BiddingForTravel.com Priceline ? Map Orbitz Outline Extracting data from unstructured and ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans Outline Extracting data from unstructured and ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans Ungrammatical & Unstructured Text Ungrammatical & Unstructured Text For simplicity “posts” <hotelArea>univ. ctr.</hotelArea> Goal: <price>$25</price><hotelName>holiday inn sel.</hotelName> Wrapper based IE does not apply (e.g. Stalker, RoadRunner) NLP based IE does not apply (e.g. Rapier) Reference Sets IE infused with outside knowledge “Reference Sets” Collections of known entities and the associated attributes Online (offline) set of docs CIA World Fact Book Online (offline) database Comics Price Guide, Edmunds, etc. Algorithm Overview – Use of Ref Sets Post: $25 winning bid at holiday inn sel. univ. ctr. Reference Set: Holiday Inn Select Hyatt Regency Ref_hotelName University Center Downtown Ref_hotelArea Record Linkage $25 winning bid at holiday inn sel. univ. ctr. Holiday Inn Select University Center “$25”, “winning”, “bid”, … Extraction $25 winning bid … <price> $25 </price> <hotelName> holiday inn sel.</hotelName> <hotelArea> univ. ctr. </hotelArea> <Ref_hotelName> Holiday Inn Select </Ref_hotelName> <Ref_hotelArea> University Center </Ref_hotelArea> Our Record Linkage Problem Posts not yet decomposed attributes. Extra tokens that match nothing in Ref Set. Post: “$25 winning bid at holiday inn sel. univ. ctr.” Reference Set: hotel name hotel area Holiday Inn Greentree Holiday Inn Select University Center Hyatt Regency Downtown hotel name hotel area Our Record Linkage Solution P = “$25 winning bid at holiday inn sel. univ. ctr.” Record Level Similarity + Field Level Similarities VRL = < RL_scores(P, “Hyatt Regency Downtown”), RL_scores(P, “Hyatt Regency”), RL_scores(P, “Downtown”)> Binary Rescoring Best matching member of the reference set for the post Extraction Algorithm Post: $25 winning bid at holiday inn sel. univ. ctr. Generate VIE Multiclass SVM VIE = <common_scores(token), IE_scores(token, attr1), IE_scores(token, attr2), …> $25 winning bid at holiday inn sel. univ. ctr. price $25 hotel name holiday inn sel. Clean Whole Attribute hotel area univ. ctr. Experimental Data Sets Hotels Posts 1125 posts from www.biddingfortravel.com Pittsburgh, Sacramento, San Diego Star rating, hotel area, hotel name, price, date booked Reference Set 132 records Special posts on BFT site. Per area – list any hotels ever bid on in that area Star rating, hotel area, hotel name Comparison to Existing Systems Record Linkage WHIRL RL allows non-decomposed attributes Information Extraction Simple Tagger (CRF) State-of-the-art IE Amilcare NLP based IE Record linkage results Prec. Recall F-Measure Hotel Phoebus 93.60 91.79 92.68 WHIRL 83.52 83.61 83.13 10 trials – 30% train, 70% test Token level Extraction results: Hotel domain Prec. Area Date Name Price Star Recall F-Measure Phoebus 89.25 87.50 88.28 Simple Tagger 92.28 81.24 86.39 Amilcare 74.2 78.16 76.04 Phoebus 87.45 90.62 88.99 Simple Tagger 70.23 81.58 75.47 Amilcare 93.27 81.74 86.94 Phoebus 94.23 91.85 93.02 Simple Tagger 93.28 93.82 93.54 Amilcare 83.61 90.49 86.90 Phoebus 98.68 92.58 95.53 Simple Tagger 75.93 85.93 80.61 Amilcare 89.66 82.68 85.86 Phoebus 97.94 96.61 97.84 Simple Tagger 97.16 97.52 97.34 Amilcare 96.50 92.26 94.27 Freq 809.7 751.9 1873.9 850.1 766.4 Not Significant Outline Extracting data from unstructured and ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans Discovering Models of Sources Required for Integration Provide uniform access to heterogeneous sources Source definitions are used to reformulate queries New service, no source model, no integration! Can we discover models automatically? United Mediator Query SELECT MIN(price) FROM flight WHERE depart=“MXP” AND arrive=“PIT” Source Definitions: - United - Lufthansa - Qantas Web Services Lufthansa Reformulated Query Qantas calcPrice(“MXP”,“PIT”,”economy”) Alitalia new service Inducing Source Definitions: A Simple Example Step 1: use metadata to classify input types Step 2: invoke service and classify output types known source Mediator LatestRates($country1,$country2,rate):exchange(country1,country2,rate) Semantic Types: currency {USD, EUR, AUD} rate {1936.2, 1.3058, 0.53177} currency Predicates: exchange(currency,currency,rate) new source rate RateFinder($fromCountry,$toCountry,val):- ? {<EUR,USD,1.30799>,<USD,EUR,0.764526>,…} Inducing Source Definitions: A Simple Example Step 3: generate plausible source definitions Step 4: reformulate in terms of other sources Step 5: invoke service and compare output match currency rate Input RateFinder Def_1 <EUR,USD> 1.30799 1.30772 <USD,EUR> 0.764526 RateFinder($fromCountry,$toCountry,val):0.764692 1.30772 <EUR,AUD> 1.68665 Mediator 1.68979 Def_2 new source 0.764692 ? 0.591789 def_1($from, $to, val) :- exchange(from,to,val) def_2($from, $to, val) :- exchange(to,from,val) def_1($from, $to, val) :- LatestRates(from,to,val) Predicates: exchange(currency,currency,rate) def_2($from, $to, val) :- LatestRates(to,from,val) The Framework Intuition: Services often have similar semantics, so we should be able to use what we know to induce that which we don’t Two phase algorithm For each operation provided by the new service: 1. Classify its input/output data types Classify inputs based on metadata similarity Invoke operation & classify outputs based on data Induce a source definition 2. Generate candidates via Inductive Logic Programming Test individual candidates by reformulating them Use Case: Zip Code Data Single real zip-code service with multiple operations The first operation is defined as: getDistanceBetweenZipCodes($zip1, $zip2, distance) :centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance). Goal is to induce definition for a second operation: getZipCodesWithin($zip1, $distance1, zip2, distance2) :centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 300). Same service so no need to classify inputs/outputs or match constants! Generating definitions: ILP Want to induce source definition for: getZipCodesWithin($zip1, $distance1, zip2, distance2) Plausible Source Definition Predicates available for generating definitions:INVALID 1 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d1), {centroid, distanceInMiles, ≤,=} (d2 = d1) 2 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d1), (d2 ≤ d1) 3 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d2 ≤ d1) Use known definition as starting point for local cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d1 ≤ d2) 4 d2 unbound! New type signature contains that of known source #d is a constant search: 6 getDistanceBetweenZipCodes($zip1, $zip2, distance) :UNCHECKABLE cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d1 ≤ #d) centroid(zip1, lat1, long1), lt1 inaccessible! cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (lt1 ≤ d1) centroid(zip2, lat2, long2), contained in distanceInMiles(lat1, long1, lat2, long2, distance). … defs 2 & 4 n cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d2 ≤ d1), (d1 ≤ #d) 5 Preliminary Results Settings: Number of zip code constants initially available: 6 Number of samples performed per trial: 20 Number of candidate definitions in search space: 5 Results: Converged on “almost correct’’ definition!!! getZipCodesWithin($zip1, $distance1, zip2, distance2) :centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 243). Number of iterations to convergence: 12 Related Work Classifying Web Services (Hess & Kushmerick 2003), (Johnston & Kushmerick 2004) Classify input/output/services using metadata/data We learn semantic relationships between inputs & outputs Category Translation (Perkowitz & Etzioni 1995) Learn functions describing operations available on internet We concentrate on a relational modeling of services CLIO (Yan et. al. 2001) Helps users define complex mappings between schemas They do not automate the process of discovering mappings iMAP (Dhamanka et. al. 2004) Automates discovery of certain complex mappings Our approach is more general (ILP) & tailored to web sources We must deal with problem of generating valid input tuples Outline Extracting data from unstructured and ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans Dynamically Building Integration Plans Traditional Data Integration Techniques (1). SwissProtein: P36246 (2). GeneBank: AAS60665.1 ……… Find information about all proteins that participate in Transcription process Mediator Dynamically Building Integration Plans (Cont’d) Problem Solved Here Create a web service that accepts a name of a biological process, <bname>, and returns information about proteins that participate in it New web service Mediator Problem Statement (Cont’d) Assumption Information-producing web service operations Applicability Biological data web services Geospatial services (WMS, WFS) Other applications that do not focus on transactions Query-based Web Service Composition Query-based approach View web service operations as source relations with binding restrictions Create domain ontology Describe source relations in terms of domain relations Can be inferred from WSDL Combined Global-as-View / Local-as-View approach Use data integration system to answer user queries Template-based Web Service Composition Our goal is to compose new web services We need to answer template queries, not specific queries Template-based Query Approach Generate plans to take into account general parameter values, Easy to generate universal plan i.e. Universal Plan [Schoppers, et. al.] Plans that answer template query as oppose to specific query But, plans can be very inefficient Need to generate optimized “universal integration plans” Example Scenario Sources HSProtein($id, name, location, function, seq, pubmedid) MMProtein($id, name, location, function, seq, pubmedid) Protein TranducerProtein($id, name, location, taxonid, seq, pubmedid) MembraneProtein($id, name, location, taxonid, seq, pubmedid) DipProtein($id, name, location, taxonid, function) Protein-Protein Interactions MMProteinInteractions($fromid, toid, source, verified) HSProteinInteractions($fromid, toid, source, verified) Example Rules and Query ProteinProteinInteractions(fromid, toid, taxonid, source, verified):HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified) Q(fromid, toid, taxonid, source, verified):fromid = !fromid, taxonid = !taxonid, ProteinProteinInteractions(fromid, toid, taxonid, source, verified) Unoptimized Plan Optimized Plan Exploit constraints in source description to filter queries to sources Example Scenario Q1(fromid, fromname, fromseq, frompubid, toid, toname, toseq, topubid):fromid = !fromproteinid, Protein(fromid, fromname, loc1, f1, fromseq, frompubid, taxonid1), ProteinProteinInteractions(fromid, toid, taxonid, source, verified), Protein(toid, toname, loc2, f2, toseq, topubid, taxonid2) Output Input Fromproteinid, fromseq, Toproteinid, toseq Fromproteinid ComposedPlan Fromproteinid, fromseq Join Protein Protein-Protein Interactions Fromproteinid, Toproteinid Fromproteinid, Toproteinid, toseq Protein Example Integration Plan Adding Sensing Operations for Tuple-level Filtering Compute original plan for a template query For each constraint on the sources Introduce constraint into the query Rerun inverse rules algorithm Compare cost of new plan to original plan Save plan with lowest cost Optimized Universal Integration Plan Outline Extracting data from unstructured and ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans Dataflow-style, Streaming Execution Map datalog plans into streaming, dataflow execution system (e.g., network query engine) We use the Theseus execution system since it supports recursion Key challenges Mapping non-recursive plans Mapping recursive plans Data processing Loop detection Query results update Termination check Recursive callback Example Translation ProteinProteinInteractions(fromid, toid, taxonid, source, verified):HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified) Q(fromid, toid, taxonid, source, verified):ProteinProteinInteractions(fromid, toid, taxonid, source, verified), (fromid = !fromproteinid), (taxonid = !taxonid) Example Theseus Plan Bio-informatics Domain Results Experiments in Bio-informatics domain where we have 60 real web services provided by NCI We varied number of domain relations in a query from 1-30 and report composition time with execution time Time in Miliseconds 16000 14000 12000 10000 Execution Time 8000 Composition Time 6000 4000 2000 0 1 2 3 4 5 6 # of Relations in Query 7 8 Tuple-level Filtering Tuple-level filtering can improve the execution time of the generated integration plan by up to 53.8% Improvement due to Theseus Theseus can improve the execution time of the generated web service with complex plans by up to 33.6% Discussion Huge number of sources available Need tools and systems that support the dynamic integration of these sources In this talk, I described techniques for: Extracting data from unstructured and ungrammatical sources Discovering models of online sources required for integration Dynamic and efficient integration of web sources Efficient execution of integration plans Much work still left to be done… More information… http://www.isi.edu/~knoblock Matthew Michelson and Craig A. Knoblock. Semantic Annotation of Unstructured and Ungrammatical Text In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005 Mark James Carman and Craig A. Knoblock. Inducing source descriptions for automated web service composition, In Proceedings of the AAAI 2005 Workshop on Exploring Planning and Scheduling for Web Services, Grid, and Autonomic Computing, 2005. Snehal Thakkar, Jose Luis Ambite, and Craig A. Knoblock. Composing, optimizing, and executing plans for bioinformatics web services, VLDB Journal, Special Issue on Data Management, Analysis and Mining for Life Sciences, 14(3):330--353, Sep 2005.