Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004 The Structure Chasm Authoring Querying Data sharing Writing text keywords Easy But we can pose complex queries Creating a schema Using someone else’s schema Committees, standards Why is This a Problem? Databases used to be isolated and administered only by experts. Today’s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, …) Government agencies Large corporations The web (over 100,000 searchable data sources) The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web Fundamental problem: reconciling different models of the world. Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges Large-Scale Scientific Data Sharing SwissProt OMIM UW HUGO QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. UW Microbiology UW Genome Sciences UCLA Genetics GeneClinics Non-urgent Applications B of A Fidelity IRS UW 1040 DB California IRS NY IRS County real-estate DB Employer Tax Reports Personal Data Management [Semex: Sigurdsson, Nemes, H.] Organizer, Participants Event Person Homepage Web Page Cached Document Author Softcopy Data is organized by application Sender, Recipients Paper Message Softcopy Presentation Cites Mail & calendar Papers HTML Files Presentations Finding Publications Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa Person: A. Halevy Person: Dan Suciu Person: Maya Rodrig Person: Steven Gribble Person: Zachary Ives Following Associations (1) Publication Bernstein Following Associations (2) “A survey of approaches to automatic schema matching” “Corpus-based schema matching” Publication Bernstein “Database management management for for peer-to-peer peer-to-peer computing: computing: A A vision” vision” “Matching schemas by learning from others” Following Associations (3) Cited by Publication Publication Bernstein Citations Following Associations (4) Cited Authors Publication Bernstein PIM Data Sharing Challenges Need to combine data from multiple applications/ sources. After initial set of concepts are given, extend and personalize concept hierarchy, share (parts) of our data with others, incorporate external data into our view. Need also Instance level reconciliation: Alon Halevy, A. Halevy, Alon Y. Levy – same guy! Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges Data Integration Goal: provide a uniform interface to a set of autonomous data sources. New abstraction layer over multiple sources. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, BioMediator Cal: Garlic (IBM), Ariadne (USC), XMAS (UCSD),… Recent “Enterprise Information Integration” industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM Relational Abstraction Layer Schema: the template for data. Students: SSN 123-45-6789 234-56-7890 Courses: CID CSE444 CSE541 Queries: Takes: Name Charles Dan … Category undergrad grad … Name Databases Operating systems SSN 123-45-6789 123-45-6789 234-56-7890 CID CSE444 CSE444 CSE142 … Quarter fall winter SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid Data Integration: Higher-level Abstraction Q Mediated Schema Semantic mappings Q1 Q2 SSN 123-45-6789 234-56-7890 Name Charles Dan … Category undergrad grad … CID CSE444 CSE541 Name Databases Operating systems SSN 123-45-6789 123-45-6789 234-56-7890 Quarter fall winter CID CSE444 CSE444 CSE142 … … Category undergrad grad … Q3 SSN 123-45-6789 234-56-7890 Name Charles Dan … CID CSE444 CSE541 Name Quarter Databases fall Operating systems winter SSN 123-45-6789 123-45-6789 234-56-7890 CID CSE444 CSE444 CSE142 … … SSN 123-45-6789 234-56-7890 Name Charles Dan … Category undergrad grad … CID CSE444 CSE541 Name Quarter Databases fall Operating systems winter SSN 123-45-6789 123-45-6789 234-56-7890 CID CSE444 CSE444 CSE142 … Entity Mediated Schema Phenotype Gene Sequenceable Entity Protein OMIM Structured Vocabulary Experiment Nucleotide Sequence Microarray Experiment SwissProt HUGO GeneClinics www.biomediator.org Tarczy-Hornoch, Mork LocusLink GO Entrez GEO Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code? Semantic Mappings Differences in: Names in schema Attribute grouping BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Inventory Database A Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName BookCategories ISBN Category CDCategories CDs Album ASIN Price DiscountPrice Studio ASIN Category Artists ASIN ArtistName GroupName Coverage of databases Inventory Database B Granularity and format of attributes Key Issues Formalism for mappings Reformulation algorithms Q Mediated Schema How will we create them? Q’ Q’ SSN 123-45-6789 234-56-7890 Name Charles Dan … Category undergrad grad … CID CSE444 CSE541 Name Databases Operating systems SSN 123-45-6789 123-45-6789 234-56-7890 Quarter fall winter CID CSE444 CSE444 CSE142 … … Category undergrad grad … Q’ SSN 123-45-6789 234-56-7890 Name Charles Dan … CID CSE444 CSE541 Name Quarter Databases fall Operating systems winter SSN 123-45-6789 123-45-6789 234-56-7890 CID CSE444 CSE444 CSE142 … … SSN 123-45-6789 234-56-7890 Name Charles Dan … Category undergrad grad … CID CSE444 CSE541 Name Quarter Databases fall Operating systems winter SSN 123-45-6789 123-45-6789 234-56-7890 CID CSE444 CSE444 CSE142 … Beyond Data Integration Mediated schema is a bottleneck for large-scale data sharing It’s hard to create, maintain, and agree upon. Peer Data Management Systems Piazza: [Tatarinov, H., Ives, Suciu, Mork] Mappings specified locally Map to most convenient nodes Queries answered by traversing semantic paths. Q UCLA Q3 CiteSeer Stanford Q1 Q4 UW Q5 DBLP UC Berkeley Q2 UCSD Q6 PDMS-Related Projects Hyperion (Toronto) PeerDB (Singapore) Local relational models (Trento) Edutella (Hannover, Germany) Semantic Gossiping (EPFL Zurich) Raccoon (UC Irvine) Orchestra (U. Penn) A Few Comments about Commerce Until 5 years ago: Data integration = Data warehousing. Since then: A wave of startups: Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products. Success: analysts have new buzzword – EII New addition to acronym soup (with EAI). Lessons: Performance was fine. Need management tools. Data Integration: Before Q Mediated Schema Q’ Source Q’ Source Q’ Source Q’ Source Q’ Source Data Integration: After Front-End User Applications Lens™ File Software Developers Kit InfoBrowser™ Lens Builder™ NIMBLE™ APIs XML Query Nimble Integration Engine™ Cache Compiler Executor Metadata Server Common XML View Management Tools Integration Builder Concordance Developer Relational Data Warehouse/ Legacy Mart Flat File Web Pages Data Administrator Security Tools Integration Layer XML Sound Business Models Enterprise Information 2001 2003 2005 1995 1997 1999 Source: Gartner, 1999 Explosion of intranet and extranet information 80% of corporate information is unmanaged By 2004 30X more enterprise data than 1999 The average company: maintains 49 distinct enterprise applications spends 35% of total IT budget on integrationrelated efforts Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges Languages for Schema Mapping Q Mediated Schema GAV Q’ Source LAV GLAV Q’ Source Q’ Source Q’ Source Q’ Source GLAV Mappings R1a(isbn, title,n), R1b(isbn, genre,n) Book(isbn, title, genre, year), Author(isbn, n), year < 1970 Book: ISBN, Title, Genre, Year Author: ISBN, Name R1a R2 R1b Books before 1970 R3 R4 R5 Query Reformulation Query: Find authors of humor books R5(x,y) :- Book(x,y,”Humor”) Plan: R1 Join R5 R1 Book: ISBN, Title, Genre, Year Author: ISBN, Name R2 Books before 1970 R3 R4 R5 Humor books Answering Queries Using Views Formal Problem: can we use previously answered queries to answer a new query? Challenge: need to invert query expression. Results depend on: Query language used for sources and queries, Open-world vs. Closed-world assumption Allowable access patterns to the sources MiniCon [Pottinger and H., 2001]: scales to thousands of sources. Every commercial DBMS implements some version of answering queries using views. Some Open Research Issues Managing large networks of mappings: • Consistency • Trust Improving networks: finding additional mappings Indexing: Heterogeneous data across the network Caching: Where? What? UCLA CiteSeer Stanford UW DBLP UC Berkeley UCSD Outline Two motivating scenarios: A web of structured data Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems Query reformulation Reconciling semantic heterogeneity Reconsidering authoring and querying challenges Semantic Mappings Need mappings in every data sharing architecture BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Inventory Database A “Standards are great, but there are too many.” Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName BookCategories ISBN Category CDCategories CDs Album ASIN Price DiscountPrice Studio ASIN Category Artists ASIN ArtistName GroupName Inventory Database B Why is it so Hard? Schemas never fully capture their intended meaning: We need to leverage any additional information we may have. A human will always be in the loop. Goal is to improve designer’s productivity. Solution must be extensible. Two cases for schema matching: Find a map to a common mediated schema. Find a direct mapping between two schemas. Typical Matching Heuristics We build a model for every element from multiple sources of evidences in the schemas Schema element names Descriptions and documentation ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances BooksAndCDs/Categories ~ BookCategories/Category DateTime Integer, addresses have similar formats Schema structure All books have similar attributes Models consider only the two schemas. In isolation, techniques are incomplete or brittle: Need principled combination. Using Past Experience Matching tasks are often repetitive Humans improve over time at matching. A matching system should improve too! Mediated Schema data sources LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03] Doan: 2003 ACM Distinguished Dissertation Award. Example: Matching Real-Estate Sources Mediated schema address location price agent-phone listed-price phone description comments Schema of realestate.com location listed-price phone comments realestate.com Miami, FL $250,000 (305) 729 0831 Fantastic house Boston, MA $110,000 (617) 253 1429 Great location ... ... ... ... homes.com price contact-phone extra-info $550,000 (278) 345 7215 Beautiful yard $320,000 (617) 335 2315 Great beach ... ... ... Learned hypotheses If “phone” occurs in the name => agent-phone If “fantastic” & “great” occur frequently in data values => description Learning Source Descriptions We learn a classifier for each element of the mediated schema. Training examples are provided by the given mappings. Multi-strategy learning: Base learners: name, instance, description Combine using stacking. Accuracy of 70-90% in experiments. Learning about the mediated schema. Corpus-Based Schema Matching [Madhavan, Doan, Bernstein, H.] Can we use previous experience to match two new schemas? Learn about a domain? Classifier for every corpus element Music Books Authors Authors Items Artists Information Learn general purpose knowledge Publisher Litreture CDs Categories Artists Corpus of Schemas and Matches Reuse extracted knowledge to match new schemas Exploiting The Corpus Given an element s S and t T, how do we determine if s and t are similar? The PIVOT Method: Elements are similar if they are similar to the same corpus concepts The AUGMENT Method: Enrich the knowledge about an element by exploiting similar elements in the corpus. Pivot: measuring (dis)agreement Compute interpretations w.r.t. corpus Pk= Probability (s ~ ck ) Interpretation I(s) = element s Schema S # concepts in corpus S T I(s) s I(t) t Similarity(I(s), I(t)) Interpretation captures how similar an element is to each corpus concept Compared using cosine distance. Augmenting element models S Schema Search similar corpus concepts s e M’s Element Model s Name: Instances: Type: … e f f Build augmented models Search similar corpus concepts Pick the most similar ones from the interpretation Build augmented models Robust since more training data to learn from Compare elements using the augmented models Corpus of known schemas and mappings Experimental Results Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas Performance measure: F-Measure: f 2 * precision * recall precision recall Precision and recall are measured in terms of the matches predicted. Comparison over domains direct augment pivot 0.9 Average FMeasure 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 auto real estate invsmall inventory nameaddr Corpus based techniques perform better in all the domains “Tough” schema pairs direct augment pivot 0.9 0.8 Average F-Measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 auto real estate invsmall inventory nameaddr Significant improvement in difficult to match schema pairs Mixed corpus direct augment pivot 0.9 0.8 Average F-Measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 auto + re + invsmall difficult auto + invsmall Corpus with schemas from different domains can also be useful Other Corpus Based Tools A corpus of schemas can be the basis for many useful tools: Mirror the success of corpora in IR and NLP? Back to the structure chasm: Authoring and querying. Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it gets reformulated appropriately. Conclusion Vision: data authoring, querying and sharing by everyone, everywhere. Need to make it easier to enjoy the benefits of structured data. Challenge: reconciling semantic heterogeneity schema mapping Corpus Of schemas Some References www.cs.washington.edu/homes/alon Piazza: ICDE03, WWW03, VLDB-03 The Structure Chasm: CIDR-03 Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002 Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01. Teaching integration to undergraduates: SIGMOD Record, September, 2003.