CPSC 534A – Background Rachel Pottinger January 13 and 18, 2005 Administrative notes Please note you’re supposed to sign up for one paper presentation and one discussion… for different papers Please sign up for the mailing list WebCT has been populated – make sure you can access it HW 1 is on the web, due beginning of class a week from today Overview of the next two classes Relational databases Entity Relationship (ER) diagrams Object Oriented Databases (OODBs) XML Other data types Database internals (Briefly) An extremely brief introduction to category theory Metadata management examples are interspersed Relational Database Basics What’s in a relational database? Relational Algebra SQL Datalog Relational Data Representation PName gizmo Attribute names or columns Price Category Manufacturer $19.99 gadgets GizmoWorks Power gizmo $29.99 gadgets GizmoWorks SingleTouch $149.99 photography Canon MultiTouch household Hitachi Tuples or rows $203.99 Relation or table Relational schema representation Every attribute has an atomic type (e.g., Char, integer) Relation Schema: Column headings: relation name + attribute names + attribute types Product(VarChar PName, real Price, VarChar Category, VarChar Manfacturer) often types are left off: Product(PName, Price, Category, Manfacturer) Relation instance: The values in a table. Database Schema: a set of relation schemas in the database. Database instance: a relation instance for every relation in the schema. Querying – Relational Algebra Select ()- chose tuples from a relation Project ()- chose attributes from relation Join (⋈) - allows combining of 2 relations Set-difference ( ) Tuples in relation 1, but not in relation 2. Union ( ) Cartesian Product (×) Each tuple of R1 with each tuple in R2 Find products where the manufacturer is GizmoWorks Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks Selection: σManufacturer = GizmoWorksProduct Find the Name, Price, and Manufacturers of products whose price is greater than 100 Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi Selection + Projection: πName, Price, Manufacturer (σPrice > 100Product) PName Price Manufacturer SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi Find the product names and price of products that cost less than $200 and have manufacturers where there is a Company that has a CName that matches the manufacturer, and its country is Japan Product Company PName Price Category Manufacturer Cname StockPrice Country Gizmo $19.99 Gadgets GizmoWorks GizmoWorks 25 USA Powergizmo $29.99 Gadgets GizmoWorks Canon 65 Japan SingleTouch $149.99 Photography Canon Hitachi 15 Japan MultiTouch $203.99 Household Hitachi πPName, Price((σPrice < 200Product)⋈ Manufacturer = Cname (σCountry = ‘Japan’Company)) PName Price SingleTouch $149.99 When are two relations related? You guess they are I tell you so Constraints say so A key is a set of attributes whose values are unique; we underline a key Product(PName, Price, Category, Manfacturer) Foreign keys are a method for schema designers to tell you so A foreign key states that an attribute is a reference to the key of another relation ex: Product.Manufacturer is foreign key of Company Gives information and enforces constraint SQL Data Manipulation Language (DML) Query one or more tables Insert/delete/modify tuples in tables Data Definition Language (DDL) Create/alter/delete tables and their attributes Transact-SQL Idea: package a sequence of SQL statements server Querying – SQL Standard language for querying and manipulating data Structured Query Language Many standards out there: • ANSI SQL • SQL92 (a.k.a. SQL2) • SQL99 (a.k.a. SQL3) • Vendors support various subsets of these • What we discuss is common to all of them SQL basics Basic form: (many many more bells and whistles in addition) Select attributes From relations (possibly multiple, joined) Where conditions (selections) SQL – Selections SELECT * FROM Company WHERE country=“Canada” AND stockPrice > 50 Some things allowed in the WHERE clause: attribute names of the relation(s) used in the FROM. comparison operators: =, <>, <, >, <=, >= apply arithmetic operations: stockPrice*2 operations on strings (e.g., “||” for concatenation). Lexicographic order on strings. Pattern matching: s LIKE p Special stuff for comparing dates and times. SQL – Projections Select only a subset of the attributes SELECT name, stock price FROM Company WHERE country=“Canada” AND stockPrice > 50 Rename the attributes in the resulting table SELECT name AS company, stockPrice AS price FROM Company WHERE country=“Canada” AND stockPrice > 50 SQL – Joins SELECT name, store FROM Person, Purchase WHERE name=buyer AND city=“Vancouver” AND product=“gizmo” Product ( name, price, category, maker) Purchase (buyer, seller, store, product) Company (name, stock price, country) Person( name, phone number, city) Selection: σManufacturer = GizmoWorks(Product) Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks What’s the SQL? Selection + Projection: πName, Price, Manufacturer (σPrice > 100Product) Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi What’s the SQL? PName Price Manufacturer SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi π PName, Price((σPrice <= 200Product)⋈ = Cname (σCountry = ‘Japan’Company)) Product Manufacturer Company PName Price Category Manufacturer Cname StockPrice Country Gizmo $19.99 Gadgets GizmoWorks GizmoWorks 25 USA Powergizmo $29.99 Gadgets GizmoWorks Canon 65 Japan SingleTouch $149.99 Photography Canon Hitachi 15 Japan MultiTouch $203.99 Household Hitachi What’s the SQL? PName Price SingleTouch $149.99 More SQL – Outer Joins Product What happens if there’s no value available? Company PName Price Category Manufacturer Cname StockPrice Country Gizmo $19.99 Gadgets GizmoWorks GizmoWorks 25 USA Powergizmo $29.99 Gadgets GizmoWorks Canon 65 Japan SingleTouch $149.99 Photography Canon Hitachi 15 Japan MultiTouch $203.99 Household Hitachi Foo $1.99 Gadgets Bar Select pname, Country From Product Product,outer Company join Company Where on Manufacturer Manufacturer = Cname = Cname PName Country Gizmo USA Powergizmo USA SingleTouch Japan MultiTouch Japan Foo NULL Querying – Datalog Enables expressing recursive queries More convenient for analysis Some people find it easier to understand Without recursion but with negation it is equivalent in power to relational algebra and SQL Limited version of Prolog (no functions) Datalog Rules and Queries A datalog rule has the following form: head :- atom1, atom2, …, atom,… You can read this as Distinguished Subgoal or then :- if ... variables variable Existential EDB ExpensiveProduct(N) :- Product(N,M,P) & P > $100 Arithmetic comparison or interpreted predicate constant CanadianProduct(N) :- Product(N,M,P) & Company(M, “Canada”, SP) IntlProd(N) :- Product(N,M,P) & NOT Company(M, “Canada”, SP) Head or IDB Negated subgoal - also denoted by ¬ Conjunctive Queries A subset of Datalog Only relations appear in the right hand side of rules No negation Functionally equivalent to Select, Project, Join queries Very popular in modeling relationships between databases Selection: σManufacturer = GizmoWorks(Product) Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks What’s the Datalog? Selection + Projection: πName, Price, Manufacturer (σPrice > 100Product) Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi What’s the Datalog? PName Price Manufacturer SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi πPname,Price((σPrice <= 200Product)⋈ Cname (σCountry = ‘Japan’Company)) Product Manufacturer = Company PName Price Category Manufacturer Cname StockPrice Country Gizmo $19.99 Gadgets GizmoWorks GizmoWorks 25 USA Powergizmo $29.99 Gadgets GizmoWorks Canon 65 Japan SingleTouch $149.99 Photography Canon Hitachi 15 Japan MultiTouch $203.99 Household Hitachi What’s the Datalog? PName Price SingleTouch $149.99 Bonus Relational Goodness: Views Views are relations, except that they are not physically stored. (Materialized views are stored) They are used mostly in order to simplify complex queries and to define conceptually different views of the database to different classes of users. Used also to model relationships between databases View: purchases of telephony products: CREATE VIEW telephony-purchases AS SELECT product, buyer, seller, store FROM Purchase, Product WHERE Purchase.product = Product.name AND Product.category = “telephony” Summarizing Relational DBs Relational perspective: Data is stored in relations. Relations have attributes. Data instances are tuples. SQL perspective: Data is stored in tables. Tables have columns. Data instances are rows. Query languages Relational algebra – mathematical base for understanding query languages SQL – very widely used Datalog – based on Prolog, very popular with theoreticians Views allow complex queries to be written simply Relational Metadata problems Data Integration: Planning a Beach Vacation Beach Good Weather Fodors AAA weather. com Expedia Cheap Flight wunder ground Orbitz Data Integration System Architecture User Query Virtual database Mediated Schema “Airport” Local Schema 1 Local Schema N Local Database 1 Local Database N Orbitz Expedia Data Translation Data exists in two different schemas. You have data in one, and you want to put data into the other How are the schemas related to one another? How do you change the data from one to another? Data Warehousing Data Warehouses store vast quantities of data for fast query processing, but only batch updating. Import schemas of data sources Identify overlapping attributes, etc. Build data cleaning scripts Build data transformation scripts Enable data lineage tracing Schema Evolution and Data Migration Schemas change over time; data must change with it. How do we deal with schema changes? How can we make it easy for the data to migrate How do we handle applications built on the old schema that store in the new database? Outline Relational databases Entity Relationship (ER) diagrams Object Oriented Databases (OODBs) XML Other data types Database internals (Briefly) An extremely brief introduction to category theory Entity / Relationship Diagrams Entities Product Attributes address Relationships between entities buys Keys in E/R Diagrams Every entity set must have a key name price Product category name category name price makes Company Product stockprice buys employs Person address name sin Multiplicity of E/R Relations one-one: many-one many-many 1 2 3 a b c d 1 2 3 a b c d 1 2 3 a b c d name category name price makes Company Product stockprice buys What does this say ? employs Person address name sin Roles in Relationships What if we need an entity set twice in one relationship? Product Purchase buyer salesperson Person Store Attributes on Relationships date Product Purchase Person Store Subclasses in E/R Diagrams name category price Product isa Software Product platforms isa Educational Product Age Group Keys in E/R Diagrams name Underline: category price No formal way to specify multiple keys in E/R diagrams Product Person address name SIN From E/R Diagrams to Relational Schema Entity set relation Relationship relation Entity Set to Relation name category price Product Product(name, category, price) name category price gizmo gadgets $19.99 Relationships to Relations price name category Start Year makes name Company Product Stock price Makes(product-name, product-category, company-name, year) Product-name Product-Category Company-name Starting-year gizmo gadgets gizmoWorks (watch out for attribute name conflicts) 1963 Relationships to Relations price name category Start Year makes name Company Product Stock price No need for Makes. Modify Product: name category price StartYear companyName gizmo gadgets 19.99 1963 gizmoWorks Multi-way Relationships to Relations Product name address name price Purchase Store Person Purchase( sin name , , ) Summarizing ER diagrams Entities, relationships, and attributes Also has inheritance Used to design schemas, then relational derived from it Metadata problems: Mapping ER to Relational Database design Map ER model to SQL schema Reverse engineer SQL schema to ER model Metadata problems: Round Trip Engineering Design in ER Implement in relational Modify the relational schema How do we change the ER diagram? View integration Define use-case scenario Identify views for each use-case Integrate views into a conceptual schema CPSC 534a Background: Part 2 Rachel Pottinger January 18, 2005 Administrative notes Please sign up for papers if you haven’t already (if there’s enough time, we’ll do this at the end of class) Remember that the first reading responses are due 9pm Wednesday Mail me if you can’t access WebCT Remember the 1st homework is due beginning of class Thursday General theory – trying to make sure you understand basics and have thought about it – not looking for one, true, answer. State any assumptions you make If you can’t figure out a detail on how to transform ER to relational based on class discussion, write an explanation as to what you did and why. Any other questions? Office hours? Outline Relational databases Entity Relationship (ER) diagrams Object Oriented Databases (OODBs) XML Other data types Database internals (Briefly) An extremely brief introduction to category theory Object-Oriented DBMS’s Started late 80’s Main idea: Toss the relational model ! Use the OO model – e.g. C++ classes Standards group: ODMG = Object Data Management Group. OQL = Object Query Language, tries to imitate SQL in an OO framework. The OO Plan ODMG imagines OO-DBMS vendors implementing an OO language like C++ with extensions (OQL) that allow the programmer to transfer data between the database and “host language” seamlessly. A brief diversion: the impedance mismatch OO Implementation Options Build a new database from scratch (O2) Elegant extension of SQL Later adopted by ODMG in the OQL language Used to help build XML query languages Make a programming language persistent (ObjectStore) No query language Niche market ObjectStore is still around, renamed to Exelon, stores XML objects now ODL ODL is used to define persistent classes, those whose objects may be stored permanently in the database. ODL classes look like Entity sets with binary relationships, plus methods. ODL class definitions are part of the extended, OO host language. ODL – remind you of anything? interface Person (extent People key sin) { attribute string sin; attribute string dept; attribute string name;} interface Course (extent Crs key cid) { attribute string cid; attribute string cname; relationship Person instructor; relationship Set<Student> stds inverse takes;} interface Student extends Person (extent Students) { attribute string major; relationship Set<Course> takes inverse stds;} Why did OO Fail? Why are relational databases so popular? Very simple abstraction; don’t have to think about programming when storing data. Very well optimized Relational db are very well entrenched – not enough advantages, and no good exit strategy… Metadata failure Merging Relational and OODBs Object-oriented models support interesting data types – not just flat files. Maps, multimedia, etc. The relational model supports very-highlevel queries. Object-relational databases are an attempt to get the best of both. All major commercial DBs today have OR versions – full spec in SQL99, but your mileage may vary. Outline Relational databases Entity Relationship (ER) diagrams Object Oriented Databases (OODBs) XML Other data types Database internals (Briefly) An extremely brief introduction to category theory XML eXtensible Markup Language XML 1.0 – a recommendation from W3C, 1998 Roots: SGML (from document community works great for them; from db perspective, very nasty). After the roots: a format for sharing data Why XML is of Interest to Us XML is just syntax for data Note: we have no syntax for relational data But XML is not relational: semistructured This is exciting because: Can translate any data to XML Can ship XML over the Web (HTTP) Can input XML into any application Thus: data sharing and exchange on the Web XML Data Sharing and Exchange application application object-relational Integrate XML Data Transform WEB (HTTP) Warehouse application relational data legacy data Think of all the metadata problems! From HTML to XML HTML describes the presentation HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999 XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content XML Document attributes person elements <data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand </address> <phone> 23456 </phone> <married/> </person> </data> name elements XML Terminology Elements enclosed within tags: <person> … </person> nested within other elements: <person> <address> … </address> </person> can be empty <married></married> abbreviated as <married/> can have Attributes <person id=“0005”> … </person> XML document has as single ROOT element Buzzwords What is XML? W3C data exchange format Hierarchical data model Self-describing Semi-structured XML as a Tree !! <data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> o555 <person> <name> John </name> <address> Thailand </address> <phone> 23456 </phone> </person> </data> Element node Attribute node data person person id address address name name John Mary street Maple no 345 phone Thai 23456 city Seattle Minor Detail: Order matters !!! Text node XML is self-describing Schema elements become part of the data In XML <persons>, <name>, <phone> are part of the data, and are repeated many times Relational schema: persons(name,phone) defined separately for the data and is fixed Consequence: XML is much more flexible Relational Data as XML persons XML: person person nam e phone John 3634 Sue 6343 D ic k 6363 name “John” person phone name phone 3634 “Sue” 6343 person name phone “Dick” 6363 <persons> <person> <name>John</name> <phone> 3634</phone> </person> <person> <name>Sue</name> <phone> 6343</phone> </person> <person> <name>Dick</name> <phone> 6363</phone> </person> </persons> XML is semi-structured Missing elements: <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person> no phone ! Could represent in a table with nulls name phone John 1234 Joe - XML is semi-structured Repeated elements <person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone> </person> two phones ! Impossible in tables: name phone Mary 2345 3456 ??? XML is semi-structured Elements with different types in different objects <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> structured name ! Heterogeneous collections: <persons> can contain both <person>s and <customer>s Summarizing XML XML has first class elements and second class attributes XML is semi-structured XML is nested XML is a tree XML is a huge buzzword Will XML replace relational databases? Outline Relational databases Entity Relationship (ER) diagrams Object Oriented Databases (OODBs) XML Other data types Database internals (Briefly) An extremely brief introduction to category theory Other data formats Makefiles Forms Application code Other Metadata Applications Message Mapping Map messages from one format to another Scientific data management Merge schemas from related experiments Manage transformations of experimental data Track evolution of schemas and transformations DB Application development Map SQL schema to default form Map business rule to SQL constraints and form validation code Manage dependencies between code and schemas and forms Outline Relational databases Entity Relationship (ER) diagrams Object Oriented Databases (OODBs) XML Other data types Database internals (Briefly) An extremely brief introduction to category theory How SQL Gets Executed: Query Execution Plans Select Pname, Price From Product, Company Where Manufacturer = Cname AND Price <= 200 AND Country = ‘Japan’ πPname, Price σPrice < 200 ^ Country = ‘Japan’ ⋈ Manufacturer = Cname Product Company Query optimization also specifies the algorithms for each operator; then queries can be executed Overview of Query Optimization Plan: Tree of ordered Relational Algebra operators and choice of algorithm for each operator Two main issues: For a given query, what plans are considered? Algorithm to search plan space for cheapest (estimated) plan. How is the cost of a plan estimated? Ideally: Want to find best plan. Practically: Avoid worst plans. Some tactics Do selections early Use materialized views Query Execution Now that we have the plan, what do we do with it? How do deal with paging in data, etc. New research covers new paradigms where interleaved with optimization Transactions Address two issues: Access by multiple users Remember the “client-server” architecture: one server with many clients Protection against crashes Transactions Transaction = group of statements that must be executed atomically Transaction properties: ACID ATOMICITY = all or nothing CONSISTENCY = leave database in consistent state ISOLATION = as if it were the only transaction in the system DURABILITY = store on disk ! Transactions in SQL In “ad-hoc” SQL: Default: each statement = one transaction In “embedded” SQL: BEGIN TRANSACTION [SQL statements] COMMIT or ROLLBACK (=ABORT) Transactions: Serializability Serializability = the technical term for isolation An execution is serial if it is completely before or completely after any other function’s execution An execution is serializable if it equivalent to one that is serial DBMS can offer serializability guarantees Serializability Enforced with locks, like in Operating Systems ! But this is not enough: User 1 LOCK A [write A=1] UNLOCK A ... ... ... ... LOCK B [write B=2] UNLOCK B User 2 LOCK A [write A=3] UNLOCK A LOCK B [write B=4] UNLOCK B What is wrong ? time Outline Relational databases Entity Relationship (ER) diagrams Object Oriented Databases (OODBs) XML Other data types Database internals (Briefly) An extremely brief introduction to category theory Category Theory “Category theory is a mathematical theory that deals in an abstract way with mathematical structures and relationships between them. It is half-jokingly known as ‘generalized abstract nonsense’.” [wikipedia] There is a lot of scary category theory out there. You only need to know a few terms. Category Theory Started in 1945 General mathematical theory of structures and systems of structures. Reveals how structures of different kinds are related to one another, as well as the universal components of a family of structures of a given kind. It is considered by many as being an alternative to set theory as a foundation for mathematics. It is very, very, very abstract Category Theory Definitions C is a graph. There are two classes: the objects or nodes obj(C) and the morphisms or edges or arrows mor(C). Any morphism f mor(C) has a source and target object f : a b. For any composable pair of morphisms f : a b and g : b c, there is a composition morphism (g • f) : a f a b c. g A functor translates objects and morphisms from oneg • f c category to another Diagrams commute if one can follow any path through the diagram and obtain the same result by composition a b c d Why you need a bit of category theory Lots of people like to use the term “morphism” It’s motivation behind a number of views – understanding this can make reading papers easier If you’re theoretically minded, it can give you a good way to think about the problem Overall background recap There are many different data models. We covered: Relational Databases Entity-Relationship Diagrams Object Oriented Databases XML Changing around schemas within data models creates metadata problems. So does changing schemas between data models Databases have some (largely hidden) internal processes; some of these will be related to in other papers we’ve read Theory can be handy to ground your reading. Now what? Time to read papers Prepare paper responses – it’ll help you focus on the paper, and allow for the discussion leader to prepare better discussion You all have different backgrounds, interests, and insights. Bring them into class!