Database Management Systems What Is a Database System? Database: A very large, integrated collection of data (logically related). Models a real-world enterprise Entities (e.g., teams, games) Relationships (e.g., The Forty-Niners are playing in The Superbowl) More recently, also includes active components , often called “business logic”. (e.g., the BCS ranking system) A Database Management System (DBMS) is a software system designed to store, manage, and facilitate access to databases. Database System: DBMS + data (+ applications) 1.2 Database : Applications Other examples of database applications can be: Purchases from the supermarket. Purchasing using your credit card. Booking a holiday at the travel agent. Using the local library. Using the internet. Studying at university. 1.3 Database Systems: Then 1.4 Database Systems: Today 1.5 From Friendster.com on-line tour Other databases you may use 1.6 = Is the WWW a DBMS? Fairly sophisticated search available crawler indexes pages on the web Keyword-based search for pages But, currently data is mostly unstructured and untyped search only: can’t modify the data can’t get summaries, complex combinations of data few guarantees provided for freshness of data, consistency across data items, fault tolerance, … Web sites typically have a DBMS in the background to provide these functions. The picture is changing New standards e.g., XML, Semantic Web can help data modeling Research groups (e.g., at Berkeley) are working on providing some of this functionality across multiple web sites. 1.7 “Search” vs. Query What if you wanted to find out which actors donated to John Kerry’s presidential campaign? Try “actors donated to john kerry” in your favorite search engine. 1.8 A “Database Query” Approach 1.9 Is a File System a DBMS? = Thought Experiment 1: You and your project partner are editing the same file. You both save it at the same time. Whose changes survive? A) Yours B) Partner’s C) Both D) Neither E) ??? •Thought Experiment 2: –You’re updating a file. Q: How do you write programs –The power goes out. over B) None subsystem when promises you only “???” ? –Which of your changes survive? A) All a A: Very, very carefully!! C) All Since Last Save 1.10 D) ??? it Traditional File Processing System File-based system is a collection of application programs that perform services for the end-users such as the production of reports. Each program defines and manages its own data. Early attempt to computerize the manual filing system. Files in cabinet and locks for security. For searching we may have indexing system that helps locate what we want quickly. Works well While number of items to be stored is small. For only storage or retrieval functionality of large number of items. 1.11 Traditional File Processing System The manual system becomes more inefficient while processing the information in the files. Typical real estate agent’s office holds two separate files: File for each property for sale or rent. File for each buyer and renter, and each member of staff. 1.12 Traditional File Processing System Contract Department Sales Department 1.13 Traditional File Processing System Consider the efforts that would be required to answer the following questions: What three-bedroom properties do you have for sale with a garden and garage? What flats do you have for rent within three miles of the city center? What is the average rent for a two-bedroom flat? What is the total annual salary bill for staff? How does last month’s turnover compare with the projected figure for this month? What is the expected monthly turnover for the next financial year? 1.14 File-Based Approach The file-based system was developed in response to the needs of industry for more efficient data access. Based on decentralized approach, where each department, with the assistance of Data Processing (DP) staff, stored and controlled its own data. Consider the DreamHome example. 1.15 File-Based Approach 1.16 File-Based Approach Significant amount of duplication of data. Before to discuss the limitations, it is useful to understand the terminology used in file-based systems. A file is simply a collection of records, which contains logically related data. For example, the PropertyForRent file contains six records, one for each property. Each record contains a logically connected set of one or more fields. Each field represents some characteristics of the real-world object that is being modeled. 1.17 Limitations of File-Based Approach Separation and isolation of data Each program maintains its own set of data. Users of one program may be unaware of potentially useful data held by other programs. Duplication of data Decentralized approach taken by each department. Same data is held by different programs. Wasted space, money and time and perhaps more importantly data integrity; in other words data consistency. 1.18 Limitations of File-Based Approach Data Dependence File structure is defined in the program code. Also known as a Program-Data dependence. Incompatible file formats Programs are written in different languages, and so cannot easily access each other’s files. Fixed Queries/Proliferation of application programs Programs are written to satisfy particular functions. Any new requirement needs a new program. 1.19 Database Approach All the above limitations of file-based approach can be listed as: The definition of the data is embedded in the application programs, rather than being stored separately and independently. There is no control over the access and manipulation of data outside that forced by the application programs. The above limitations were overcome with the new approach called database approach (database and DBMS). 1.20 Current Commercial Outlook A major part of the software industry: Oracle, IBM, Microsoft, Sybase also Informix (now IBM), Teradata smaller players: java-based dbms, devices, OO, … Well-known benchmarks (esp. TPC) Lots of related industries data warehouse, document management, storage, backup, reporting, business intelligence, app integration Relational products dominant and evolving adapting for extensibility (user-defined types), adding native XML support. Open Source coming on strong MySQL, PostgreSQL, BerkeleyDB 1.21 ? Why Study Databases?? Shift from computation to information always true for corporate computing Web made this point for personal computing more and more true for scientific computing Need for DBMS has exploded in the last years Corporate: retail swipe/clickstreams, “customer relationship mgmt”, “supply chain mgmt”, “data warehouses”, etc. Scientific: digital libraries, Human Genome project, NASA Mission to Planet Earth, physical sensors, grid physics network DBMS encompasses much of CS in a practical discipline OS, languages, theory, AI, multimedia, logic Yet traditional focus on real-world apps 1.22 What’s the intellectual content? representing information data modeling languages and systems for querying data complex queries with real semantics* over massive data sets concurrency control for data manipulation controlling concurrent access ensuring transactional semantics reliable data storage maintain data semantics even if you pull the plug * semantics: the meaning or relationship of meanings of a sign or set of signs 1.23 Files Vs DBMS Applications must stage large datasets between main memory and secondary storage ( e.g., buffering, page oriented access, 32-bit addressing, etc. ) Special code for different queries Must protect data from inconsistency due to multiple concurrent users Crash recovery Security and access control. 1.24 Why Databases?? Why not store everything on flat files: use the file system of the OS, cheap/simple… Name, Course, Grade John Smith, CS112, B Mike Stonebraker, CS234, A Jim Gray, CS560, A John Smith, CS560, B+ ………………… Yes, but not scalable… 1.25 Problem 1 Data redundancy and inconsistency Multiple file formats, duplication of information in different files Name, Course, Email, Grade John Smith, js@cs.bu.edu, CS112, B Mike Stonebraker, ms@cs.bu.edu, CS234, A Jim Gray, CS560, jg@cs.bu.edu, A John Smith, CS560, js@cs.bu.edu, B+ Why this a problem? Wasted space Potential inconsistencies (multiple formats, John Smith vs Smith J.) 1.26 Problem 2 Data retrieval: Find the students who took CSE Find the students with Percentage > 50 For every query we need to write a program! We need the retrieval to be: Easy to write Execute efficiently 1.27 Problem 3 Data Integrity No support for sharing: Prevent simultaneous modifications No coping mechanisms for system crashes No means of Preventing Data Entry Errors (checks must be hard-coded in the programs) Security problems Database systems offer solutions to all the above problems 1.28 Benefits of the Database Approach The data can be shared Redundancy can be reduced Inconsistency can be avoided Transaction support can be provided Integrity can be maintained Security can be enforced Conflicting requirements can be balanced Standards can be enforced 1.29 Data Organization Physical level or Internal level or storage view : describes how a record (e.g., customer) is stored. Conceptual or Logical level or community user view: describes data stored in database, and the relationships among the data in terms of data models of the DBMS. type customer = record name : string; street : string; city : integer; end; Also, External (View) level: application programs hide details of data types. Views can also hide information (e.g., salary) for security purposes. 1.30 View of Data A logical architecture for a database system 1.31 Example 1.32 Levels of Abstraction Users Views describe how users see the data. Conceptual schema defines View 1 logical structure the files and indexes used. called View 3 Conceptual Schema Physical schema describes (sometimes View 2 Physical Schema the DB ANSI/SPARC model) 1.33 Example: University Database Conceptual schema: Students(sid: string, name: string, login: string, age: integer, gpa:real) View 1 View 2 View 3 Courses(cid: string, cname:string, Conceptual Schema credits:integer) Enrolled(sid:string, cid:string, Physical Schema grade:string) External Schema (View): Course_info(cid:string,enrollment:integer) Physical schema: Relations stored as unordered files. Index on first column of Students. 1.34 DB Data Independence Applications insulated from how View 1 data is structured and stored. Logical data View 2 View 3 independence: Protection from changes in logical Conceptual Schema structure of data. Physical Schema Physical data independence: Protection from changes in physical structure of data. Q: Why are these important for DBMS? particularly 1.35 DB Architechture 1.36 Database Schema Similar to types and variables in programming languages Schema – the structure of the database e.g., the database consists of information about a set of customers and accounts and the relationship between them Analogous to type information of a variable in a program Physical schema: database design at the physical level Logical schema: database design at the logical level Instance – the actual content of the database at a particular point in time Analogous to the value of a variable 1.37 Data Models Data Model: A set of concepts to describe the structure of a database, and certain constraints that the database should obey. Data Models: a framework for describing data data relationships data semantics data constraints Data Model Operations: Operations for specifying database retrievals and updates by referring to the concepts of the data model. Operations on the data model may include basic operations and user-defined operations. 1.38 Categories of data models Conceptual (high-level, semantic) data models: Provide concepts that are close to the way many users perceive data. (Also called entity-based or object-based data models.) Physical (low-level, internal) data models: Provide concepts that describe details of how data is stored in the computer. Implementation (representational) data models: Provide concepts that fall between the above two, balancing user views with some computer storage details. 1.39 History of Data Models Relational Model: proposed in 1970 by E.F. Codd (IBM), first commercial system in 1981-82. Now in several commercial products (DB2, ORACLE, SQL Server, SYBASE, INFORMIX). Network Model: the first one to be implemented by Honeywell in 196465 (IDS System). Adopted heavily due to the support by CODASYL (CODASYL - DBTG report of 1971). Later implemented in a large variety of systems - IDMS (Cullinet - now CA), DMS 1100 (Unisys), IMAGE (H.P.), VAX -DBMS (Digital Equipment Corp.). Hierarchical Data Model: implemented in a joint effort by IBM and North American Rockwell around 1965. Resulted in the IMS family of systems. The most popular model. Other system based on this model: System 2k (SAS inc.) 1.40 Slide 2-40 History of Data Models Object-oriented Data Model(s): several models have been proposed for implementing in a database system. One set comprises models of persistent O-O Programming Languages such as C++ (e.g., in OBJECTSTORE or VERSANT), and Smalltalk (e.g., in GEMSTONE). Additionally, systems like O2, ORION (at MCC - then ITASCA), IRIS (at H.P.- used in Open OODB). Object-Relational Models: Most Recent Trend. Started with Informix Universal Server. Exemplified in the latest versions of Oracle-10i, DB2, and SQL Server etc. systems. 1.41 Hierarchical Model 1.42 Hierarchical Model • Advantages: • Hierarchical Model is simple to construct and operate on • Corresponds to a number of natural hierarchically organized domains - e.g., assemblies in manufacturing, personnel organization in companies • Language is simple; uses constructs like GET, GET UNIQUE, GET NEXT, GET NEXT WITHIN PARENT etc. • Disadvantages: • Navigational and procedural nature of processing • Database is visualized as a linear arrangement of records • Little scope for "query optimization" 1.43 Network Model 1.44 Network Model • Advantages: • Network Model is able to model complex relationships and represents semantics of add/delete on the relationships. • Can handle most situations for modeling using record types and relationship types. • Language is navigational; uses constructs like FIND, FIND member, FIND owner, FIND NEXT within set, GET etc. Programmers can do optimal navigation through the database. • Disadvantages: • Navigational and procedural nature of processing • Database contains a complex array of pointers that thread through a set of records. • Little scope for automated "query optimization” 1.45 Entity-Relationship Model Example of schema in the entity-relationship model 1.46 Entity Relationship Model (Cont.) E-R model of real world Entities (objects) E.g. customers, accounts, bank branch Relationships between entities E.g. Account A-101 is held by customer Johnson Relationship set depositor associates customers with accounts Widely used for database design Database design in E-R model usually converted to design in the relational model (coming up next) which is used for storage and processing 1.47 Relational Model Attributes Example of tabular data in the relational model Customer-id customername 192-83-7465 Johnson 019-28-3746 Smith 192-83-7465 Johnson 321-12-3123 Jones 019-28-3746 Smith customerstreet customercity accountnumber Alma Palo Alto A-101 North Rye A-215 Alma Palo Alto A-201 Main Harrison A-217 North Rye A-201 1.48 Data Independence Applications insulated from how data is structured and stored. Logical data independence: Protection from changes in logical structure of data. Physical data independence: Protection from changes in physical structure of data. One of the most important benefits of using a DBMS! 1.49 Data Storage Data Storage Where can data be stored? Main memory Secondary memory (hard disks) Optical storage (DVDs) Tertiary store (tapes) Move data? Determined by buffer manager Mapping data to files? Determined by file manager 1.50 Storage Management Storage manager is a program module that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system. The storage manager is responsible to the following tasks: Interaction with the OS file manager Efficient storing, retrieving and updating of data Issues: Storage access File organization Indexing and hashing 1.51 Database Architecture (data organization) DBA DDL Commands DDL Interpreter File Manager Buffer Manager Storage Manager Data Secondary Storage Metadata Schema 1.52 Data retrieval Queries Query = Declarative data retrieval describes what data, not how to retrieve it Ex. Give me the students with GPA > 3.5 vs Scan the student file and retrieve the records with gpa>3.5 Why? 1. Easier to write 2. Efficient to execute (why?) 1.53 Data retrieval Query Query Processor Plan Query Optimizer Query Evaluator Data Query Optimizer “compiler” for queries (aka “DML Compiler”) Plan ~ Assembly Language Program Optimizer Does Better With Declarative Queries: 1. Algorithmic Query (e.g., in C) 1 Plan to choose from 2. Declarative Query (e.g., in SQL) n Plans to choose from 1.54 Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation 1.55 Query Processing (Cont.) Alternative ways of evaluating a given query Equivalent expressions Different algorithms for each operation Cost difference between a good and a bad way of evaluating a query can be enormous Need to estimate the cost of operations Depends critically on statistical information about relations which the database must maintain Need to estimate statistics for intermediate results to compute cost of complex expressions 1.56 Data Definition Language (DDL) Specification notation for defining the database schema E.g. create table account ( account-number char(10), balance integer) DDL compiler generates a set of tables stored in a data dictionary Data dictionary contains metadata (i.e., data about data) Database schema Data storage and definition language language in which the storage structure and access methods used by the database system are specified Usually an extension of the data definition language 1.57 Data Manipulation Language (DML) Language for accessing and manipulating the data organized by the appropriate data model DML also known as query language Two classes of languages Procedural – user specifies what data is required and how to get those data Nonprocedural – user specifies what data is required without specifying how to get those data SQL is the most widely used query language 1.58 SQL SQL: widely used (declarative) non-procedural language E.g. find the name of the customer with customer-id 192-83-7465 select customer.customer-name from customer where customer.customer-id = ‘192-83-7465’ E.g. find the balances of all accounts held by the customer with customer-id 192-83-7465 select account.balance from depositor, account where depositor.customer-id = ‘192-83-7465’ and depositor.account-number = account.account-number Procedural languages: C++, Java, relational algebra 1.59 Data retrieval: Indexing How to answer fast the query: “Find the student with SID = 101”? One approach is to scan the student table, check every student, retrurn the one with id=101… very slow for large databases Any better idea? 1st keep student record over the SID. Do a binary search…. Updates… 2nd Use a dynamic search tree!! Allow insertions, deletions, updates and at the same time keep the records sorted! In databases we use the B+-tree (multiway search tree) 3rd Use a hash table. Much faster for exact match queries… but cannot support Range queries. (Also, special hashing schemes are needed for dynamic data) 1.60 1.61 180 200 150 156 179 120 130 100 101 110 30 35 3 5 11 180 150 100 30 120 B+Tree Example B=4 Root Database Users Users are differentiated by the way they expect to interact with the system Application programmers – interact with system through DML calls Sophisticated users – form requests in a database query language Specialized users – write specialized database applications that do not fit into the traditional data processing framework Naïve users – invoke one of the permanent application programs that have been written previously E.g. people accessing database over the web, bank tellers, clerical staff 1.62 Database Administrator Coordinates all the activities of the database system; the database administrator has a good understanding of the enterprise’s information resources and needs. Database administrator's duties include: Schema definition Storage structure and access method definition Schema and physical organization modification Granting user authority to access the database Specifying integrity constraints Acting as liaison with users Monitoring performance and responding to changes in requirements 1.63 Database Architecture (data retrieval) DB Programmer User Code w/ embedded queries DBA Query DDL Commands Query Optimizer DML Precompiler Query Evaluator Query Processor File Manager Storage Manager Buffer Manager Secondary Storage Indices Data Statistics Metadata Schema 1.64 DDL Interpreter Data Integrity Transaction processing Why Concurrent Access to Data must be Managed? John and Jane withdraw $50 and $100 from a common account… John: Jane: 1. get balance 1. get balance 2. if balance > $50 2. if balance > $100 3. balance = balance - $50 3. balance = balance - $100 4. update balance 4. update balance Initial balance $300. Final balance=? It depends… 1.65 Transaction An execution of a DB program Key concept is transaction, which is an atomic sequence of database actions (reads/writes). ACID properties A – Atomicity C – Consistency I – Isolation D – Durability How: log and concurrency control sub-system 1.66 Data Integrity Recovery Transfer $50 from account A ($100) to account B ($200) 1. get balance for A 2. If balanceA > $50 3. balanceA = balanceA – 50 4.Update balanceA in database System crashes…. 5. Get balance for B 6. balanceB = balanceB + 50 7. Update balanceB in database Recovery management 1.67 Transaction Management What if the system fails? What if more than one user is concurrently updating the same data? A transaction is a collection of operations that performs a single logical function in a database application Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures. Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database. 1.68 Database Architecture DB Programmer DBA User Code w/ embedded queries DDL Commands Query Query Optimizer DML Precompiler Query Evaluator Query Processor File Manager Transaction Manager Recovery Manager Buffer Manager Storage Manager Secondary Storage DDL Interpreter Indices Data Metadata Integrity Constraints Statistics Schema 1.69 Client /Server Architecture The overall purpose of a database system is to support the development and execution of database applications. From high-level point of view, such system can be regarded as having simple two-part structure consisting of a Server called the back end, and a set of clients also called the front ends. The server is DBMS itself. It supports all basic DBMS functions like Data Definition Data manipulation Data Security and integrity and etc. 1.70 Client /Server Architecture The clients are various applications that run on the top of DBMS- both user written applications and built-in applications User written applications- regular application programs written either in a conventional 3GL like C++, COBOL or in some proprietary 4GL. Vendor provided applications (tools)- applications whose basic purpose is to assist in the creation and execution of other applications. Ex: Query language processors, Report writers, Business graphic sub systems, spreadsheets, statistical packages, Natural language processors, data extract tools, application generators, other application tools including computer-aided software engineering (CASE) products, Data mining and visualization tools. 1.71 Database Applications Banking: all transactions Airlines: reservations, schedules Universities: registration, grades Sales: customers, products, purchases Manufacturing: production, inventory, orders, supply chain Human resources: employee records, salaries, tax deductions Databases touch all aspects of our lives 1.72