CS545 intro Why Databases? September 2001 Gio Wiederhold Stanford University www-db.stanford.edu/people/gio.html Sep-01 CS545 Intro 1 Abstract The distinction of storing data in files and databases is that databases are intended to be used by multiple programs and types of users. Databases have been available in various forms since 1958. The major paper defining database functionality in a formal sense is due to Ted Codd, of IBM, published in 1970. Information is created by applying knowledge (encoded as programs or rules) to collected data and message received. Data and computation resources are provided by a variety of suppliers, public and private. The number of potential suppliers and their autonomy also creates information overload To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change. The autonomy of the suppliers causes heterogeneity and inconsistencies. The semantics of diverse sources are captured by their ontologies, the collection of terms and their relationships as used in the domain of discourse for the source. When sources are to be related we rely on their ontologies to make the linkages. . Creating a sound algebra encompassing the required operations allows manipulation and composition of the interoperation process. Sep-01 CS545 Intro 2 Outline • • • • • • • Motivation and Functions needed Early Inventions Architecture Formal basis Breadth of applicability Unsolved problems Research Directions Sep-01 CS545 Intro 3 Files versus Databases Files: provide input and output for a program •(transient) • Devices: Paper tape (ascii), Cards, Magnetic Tapes • Examples: 1. FORTRAN: tapes 1-5 input, 5 standard in ( 80 column cards) tapes 6-7 output, 6 print (120 cols), 7 punch ( 80 cols) still visible in files, IBM VM OS 2. UNIX: standard in > Standard out 3. Data-processing: in > •> out = in > •> out = in > •> out .... Databases: storage (persistent, reliable, random access) • Enabled by disk - technology, starting in 1960 (5MB) • Many users, i.e., many (small) programs •••• • Example: 1. BOMP – Bill-of-materials (inventory), airline seats, processing Sep-01 CS545 Intro 4 Files • Files: a means for programs to store data for later use – The initial program •determines 1. what data are being stored (all? – memory dump [LISP] ) 2. how it is being stored – structure and format 3. when it is being stored and available – successor programs must follow these decisions • often the successor program is another invocation of the initial program • • Problems – One program requires a different structure than another: BOMP – Data must be available rapidly, incrementally: • Class-assignments • seat reservations • library checkout – Programs •must be available continuously, depend on data Sep-01 CS545 Intro 5 Databases • Data are intended to be used by many programs – Often small – transactions – Various subsets of the all the relevant data – Structural transformations: Bill-of-Materials Programs: Input program Output program Records parts being delivered Records parts being consumed Products :> parts Supplier :> parts Inventory Suppliers, Products :> parts Sep-01 CS545 Intro 6 BoMPs are common • • • • • • • Supplier Parts Product-Assemblies Clinical-labs Observations Patient-Records Employees Salary & Tasks Productivity Accidents Reports Failure-Analysis Flights Seats Passengers Classes Grades Student-Performance ... Two directions / hierarchies needed for data access: Data sources Stuff Data consumption Solutions? Sep-01 CS545 Intro 7 Design Problem & Solutions Conceptual - model • Supplier program: – Use a hierarchy: supplier parts supplied ( 1: n ) • Consumer program: – Use a hierarchy: consumer parts used ( 1: m ) Actual solution in memory: Matrix: if it exceeds memory then either supplier or consumer part accesses become costly Actual solution beyond memory: 1. redundant transformed data 2. pointer and index structures Sep-01 CS545 Intro s1 s2 s3 c1 c2 c3 sn P cm 8 Factors influencing design • Size --- memories are getting bigger, problems too • Density of matrix: – suppliers supply only some parts, overlapping – products consume only some parts, overlapping • Performance requirements: – supplier response can be less critical – airline seats made available versus seats being sold – laboratory data obtained versus patient records needed • Usage patterns: – batches versus single item accesses – linked according to yet other criteria: Sep-01 CS545 Intro 9 DBMSs Database Management Systems • Collection of the software needed to manage databases • Components: – – – – Storage management – intertwined with the operating systems Query and update processor – uses the schema Schema interpreter and compiler Transaction management and concurrency control/protection – also jointly with OS – Logger for backup – Recovery programs • Large, complex, not all features always needed • Many fewer vendors now than 10 yesrs ago Sep-01 CS545 Intro 10 Inventions – 1 - Data Description • Schemas [McGee, 1958] program independence = A symbolic description of each column, to be interpreted by update and retrieval programs as well as users – Allows programs to use subsets – Allows columns to be added without affecting current programs • Compilation of Schemas [1975] = avoids interpretation cost – requires keeping track of last update for auto-recompile • Views [Chamberlin et al., 1976] Bounded schemas = Data base adminiistrator defines schema subset for user roles – Can be compiled for fast execution – Must be recompiled when base schema or view is changed. Sep-01 CS545 Intro 11 Inventions – 2 – access trees • Indexes [Landauer 1963] balanced trees = Efficient ancillary access path – Requires updating to stay current • Multiple Indexes [DavisLin 1965] multi-attribute-based access = Multiple ancillary access paths – Allows access by multiple paths – Requires much updating to stay current • B-trees [Bayer, 1972] Index Updateability = Index blocks are kept only 50%-100% full for mostly fast update – Improves performance greatly for indexes Sep-01 CS545 Intro 12 Inventions – 3 - structures • Hierarchical Structures [IMS, 1963] Dense data structures = – – – Trees mapped to sequential structures for fast access to sparse data Fast access when many related values are needed Costly to update, often done periodically Must be combined with trees for multiple-access paths • Triple storage [Feldman, 1969] Arbitrary structures = All data represented by object-attribute-value entries – High cost when many related values are needed Note that these two conflict – in today's database implementations performance has won out over flexibility Sep-01 CS545 Intro 13 Inventions – 4 – model foodfight • Relational Model [Codd 1970] = tabular model, with an algebraic set of operations, normalization – Formalization enabled understanding, dissemination – No inter-relation semantics, specified when query is made – Later constraints were added, implicitly defining keys, connections • Hierarchical - (also applied to one view of BOMPs) = describe hierarchical connections among data records, no algebra – An attempt to describe earlier, simple implementations in model terms • Network – generalization of BOMP = describe structure, procedural navigation in near-arbitrarily linked data Strong inter-record connections, needed for locating data Sep-01 CS545 Intro 14 Why did the relational model win? • Relational Model DBMSes – – – – Sequel QUEL, SQL Formality – allowed essential optimization algorithms Restrictions – as normalization, provide guidance Teachability – exposed principles: • can't teach only from examples DBMS independence – safety blanket for mission-critical users • But implementations added features • Use least common set of features? – Hard to enforce once a system has been bought • Few suppliers remain {ORACLE. IBM. MS, mySQL} • ER model [Chen, 1976] = Focuses on design, can be mapped to multiple implementations – Few tools for direct translation – Poor maintenance of model, ignored when DBs are expanded Sep-01 CS545 Intro 15 Databases and the Web • HTML presentation: Hierarchical Markup Language = Data are transformed for human consumption, external refs – Often hierarchical – object-oriented view – If there was a schema, it is now hidden • XML presentation = Schema data is embedded – Much flexibility – Much more space when entries are small – Requires an interpretation for viewing as XSLT • RDF Resource description Formalism = Triple representation: object-attribute-value – Great flexibility – Uncertain implementation Sep-01 CS545 Intro 16 Information Data overload starvation • More databases – public & corporate • Faster communication – digital – packeting: TCP-IP, ATM • World-wide connectivity – Internet & Intranets – world-wide web • Disintermediation – ubiquitous publishing Sep-01 CS545 Intro 17 Change in Supply vs Demand What information consumes is rather obvious, it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. Sep-01 CS545 Intro 18 [Herbert Simon] Making data relevant • Data reduction • Data abstraction – – – – Level changing Summarization Exception search Level change to integrate with other data sources • Follow Customer Model: hierarchical, divide-and-conquer, a common paradigm Sep-01 CS545 Intro 19 Data and Knowledge Data Loop Knowledge Loop Storage Education Selection Abstraction Integration Recording Summarization Experience Decision-making State changes Action Sep-01 CS545 Intro Information is created at the confluence of data -- the state & knowledge -the ability to select and project the state into the future 20 Transforming Data to Information Application Layer Mediation Layer Foundation Layer Sep-01 users at workstations value-added services data and simulation resources CS545 Intro 21 Functions inside Mediation articulation Summarize Transform Heterogenous Selection Sep-01 resources CS545 Intro 22 Function of Mediation Apply Domain-specific Specialist Knowledge to add value • • • • • • • to locate data sources to convert for consistency to integrate from diverse sources to describe data for processing to abstract for insight / models to extrapolate to new situations to summarize for presentation INFORMATION Sep-01 CS545 Intro 23 Environmental Restoration at INEL Undoing 50 years of messes …. MSL [Stanford] OQL [ODMG] MQL [ISX] OEM QEM OEM QEM other mediators wrapper OEM QEM QEM OEM mediator QEM OEM OEM QEM CORBA OEM QEM wrapper QEM wrapper wrapper Many projects many sources ERIS LOCKHEED MARTIN June 1998 Sep-01 IEDMS ISX - Stanford Univ. Idaho National Engineering Laboratory CS545 Intro 24 From Schemas to Ontologies Ontologies allow communication among partners in enterprises (rarely in machine-readable form) Relationships determine meaning - parent, school, company Databases use ontologies during design in their E-R diagrams (implicitly) and to represent the leaf nodes in their schemas. Variable and Class names in Software Knowledge-bases use term ontologies (often explicitely), add class definition (to hold instances), constraints, and operations among the terms. Sep-01 CS545 Intro 25 Ontology: components . We represent the contents and structure of a languages by its ontology: • a set of well-defined terms, which delimit the domain of discourse • relationships among those terms, chosen from a limited set a formalizable subset of expert knowledge Sep-01 CS545 Intro 26 Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems • Representation and Access Conventions • Naming and Ontologies Sep-01 CS545 Intro 27 Unsolved problem in Interoperation Common assumption in assembling and integrating distributed information resources • The language used by the resources is the same • Sublanguages used by the resources are subsets of a globally consistent language This assumption is provably false. Working towards the goal of global consistency is 1. naïve -- the goal cannot be achieved 2. inefficient -- languages are efficient in local contexts Sep-01 CS545 Intro 28 Large Ontologies: good or bad? Have all the Knowledge together + simple for customers of KBs – hard for owners of KBs, must synchronize with many others – in the limit -- everybody must be globally consistent Large KB will cover multiple / all domains created by a committee -- slow maintained by a committee – costly to impssible Differences in level of abstraction -- efficiency homeowner: nail carpenter: sinker, brad, boxnail, . . . Sep-01 CS545 Intro 29 Evolution of mediation applications A2 A1 A4 A3 A5 A6 integrators a. I2 I1 mediators network b. M1 c. d. wrappers D1 W2 W1 D2 D4 W3 D5 M2 e. D6 D3 datasources Sep-01 CS545 Intro 30 Definition* A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications. It should be small and simple, so that it can be maintained by one expert or, at most, a small and coherent group of experts. * Wiederhold: IEEE Computer March 1992 Sep-01 CS545 Intro 31 Interfaces Human Computer {x-widgets, HTML} Application Mediator {OQL, KQML, ...} Mediator Data sources {SQL, TQL, XML, … } Data real world {sensors, clerks, … } Sep-01 CS545 Intro 32 An Integration Architecture Client Application portfolios for each company Mediator stock market prices Sep-01 business reports Wrapper Wrapper Ticker Tape Dialog CS545 Intro 33 Status of Mediation Technology Today • Handcrafted • Expert consults with programmer • Programmer codes the knowledge needed • Resource changes require advise, program update Sep-01 Future • Generated from models • Domain Expert maintains models • Specification determines functions • Resource changes trigger regeneration CS545 Intro 34 A mediator is not static software: Knowledge ages Application Interface Changes of user needs Software & People Models, programs, rules, caches, . . . Owner / Creator Maintainer Lessor - Seller Advertisor Resource changes Resource Interfaces Sep-01 Domain changes CS545 Intro 35 Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size automously maintainable Empowerment * based on experience with software Sep-01 CS545 Intro 36 Roles Computer Scientists • Provide tools – – – – adapatation integration matching composing • Assess Standards • Assure scalability Sep-01 Domain Experts • Learn to use the tools • Select resources • Assess their value • Rank their quality • Resolve semantics • Get client feedback • Give provide feedback CS545 Intro 37 Mediation Research Topics • • • • Mediator management and maintenance Representation of knowledge and customer models Balancing dynamic and warehouse solutions Formalization of semantic heterogneities – – – – many levels and types roles for wrappers vs. mediators vs. applications scalability by partitioning -- make it simple! Domain Ontologies --- tools, validation, . . . • Effect of object paradigm and method-based access • Service and business models • New types of information systems Sep-01 CS545 Intro 38 Long Range Science Vision Databases access storage algebras Systems Engineering analysis documentation costing Artificial Intelligence knowledge mgmt domain expertise uncertainty Integration Methods Integration Science Sep-01 CS545 Intro GIS 39 Fat versus thin mediators • too thin: insufficient added value • Too fat: hard to compose service scope • Too narrow: few costumers • too broad: hard to maintain, needs a committee domain scope Sep-01 Just right CS545 Intro 40 Maintenance is good for you ? 13 12 11 100% 10 9 90 8 80 7 70 6 60 lifetime 5 50 4 40 3 30 2 1 20 10 relative annual maintenance cost depreciation = 1 / lifetime years 0 automobile Sep-01 hardware CS545 Intro software 41 Client-Server Architecture Client system s X Fast build of clients by resource reuse data and simulation resources Changes (x) are difficult, can affect many clients Sep-01 CS545 Intro 42 Systems with Mediators Gio Wiederhold. 1995 Applications . . . . Mediators . . . . . . Data Resources . . . Sep-01 CS545 Intro 43 Growth through Reuse Gio Wiederhold. 1995 New Application Prior & Revised Mediators Extended Data Resources Sep-01 CS545 Intro 44 Linear O(n) Cost of Growth-- now O(n2) • Data changes only affect some mediators; only in their domain • Mediators can 1. supply old information to n-1 prior applications 2. provide better information to the new application 3. be partially or completely reused • New applications, using the new data, can be developed and inserted dynamically Sep-01 CS545 Intro 7 2 45 Assigning maintenance responsibility a. Source data quality – supplier database, files, or web pages b. Interface to the source – Sources wrapper, supplier or vendor for supplier c. Source selection – expert specialist in mediator d. Source quality assessment – customer input to mediator Services e. Semantic interoperation – specialist group providing input to the mediator f. Consistency and metadata information – mediator service operation or warehouse g. Informal, pragmatic integration – client services with customer input h. User presentation formats – Customers client services with customer input Sep-01 CS545 Intro 46