Multimedia Information Systems CS 4570 Outlines • Introduction to DMBS • Relational database and SQL • B+- tree index structure Database Management System (DBMS) • Collection of interrelated data • Set of programs to access the data • DBMS contains information about a particular enterprise • DBMS provides an environment that it both convenient and efficient to use Purpose of Database Systems • To overcome the disadvantages of the typical file–processing systems: – – – – – – – Data redundancy and inconsistency Difficulty in accessing data Data isolation multiple files and formats Integrity problems Atomicity of updates Concurrent access by multiple users Security problems Data and Data Models • Data is really the “given facts” • A data model is a collection of tools for describing: – – – – Data Data relationships Data semantics Data constraints • Object-based logical models: – object-oriented model, semantic model… • Record-based logical models: – relational model (e.g. SQL, DB2)… Entity-Relationship (ER) Model • A database can be modeled as: – a collection of entities, – relationships among entities Relational Model Relational Databases • A relational database is a database that is perceived be its users as a collection of tables. ( and nothing but tables ). Relation Instance • The current values (relation instance) of a relation are specified by a table. • An element t of r is a tuple; represented by a row in a table. tuple attribute Structured Query Language (SQL) • SQL is based on set and relational operations with certain modifications and enhancements • A typical SQL query has the form: select A 1 , A 2 , ..., A n from r 1 , r 2 , ..., r m where P – Ais represent attributes – ris represent relations – P is a predicate. • The result of an SQL query is a relation. Banking Example branch(branch-name, branch-city, assets) customer(customer-name, customer-street, customer-city) account(branch-name, account-number, balance) loan(branch-name, loan-number, amount) depositor(customer-name, account-number) borrower(customer-name, loan-number) The Select Clause • The select clause corresponds to the projection operation of the relational algebra. It is used to list the attributes desired in the result of a query. • Find the names of all branches in the loan relation select branch-name from loan • An asterisk in the select clause denotes”all attributes” branch-name loan-number amount select * from loan The Select Clause(Cont.) • SQL allows duplicates in relations as well as in query results. • To force the elimination of duplicates, insert the keyword distinct after select. Find the names of all branches in the loan relation, and remove duplicates select distinct branch-name from loan • The keyword all specifies that duplicates not be removed. Select all branch-name from loan The Select Clause(Cont.) • The select clause can contain arithmetic expressions involving the operators,+,-,*,/,and operating on constants or attributes of tuples. • The query: select branch-name,loan-number,amount * 100 from loan would return a relation which is the same as the loan relation, except that the attribute amount is multiplied by 100 The from Clause • The from clause corresponds to the Cartesian product operation of the relational algebra. It lists the relations to be scanned in the evaluation of the expression. • Find the Cartesian product borrower loan select * from borrower, loan • Find the name and loan number of all customers having a borrower (customer-name, loan-number) loan at the Perryridge branch. loan (branch-name, loan-number, amount) select distinct customer-name,borrower.loan-number borrower × loan from borrower,loan where borrower.loan-number=loan.loan-number ( borrower.customer-name, borrower.loan-number, and loan.branch-name, loan.loan-number, loan.amount) branch-name=“Perryridge” The where Clause • The where clause corresponds to the selection predicate of the relational algebra. It consists of a predicate involving attributes of the relations that appear in the from clause. • Find all loan numbers for loans made at the Perryridge branch with loan amounts greater than $1200. select loan-number from loan where branch-name=“Perryridge” and amount>1200 • SQL uses the logical connectives and, or, and not. It allows the use of arithmetic expressions as operands to the comparison operators. The where Clause(Cont.) • SQL includes a between comparison operator in order to simplify where clauses that specify that a value be less than or equal to some value and greater than or equal to some other value. • Find the loan number of those loans with loan amounts between $90,000 and $100,000(that is, $90,000 and $100,000) select loan-number from loan where amount between 90000 and 100000 Example Query • Find all branches that have greater assets than some branch located in Brooklyn. select distinct T.branch-name from branch as T, branch as S where T.assets > S.assets and S.branch-city =“Brooklyn” B+ -Tree Index Files B+ -Tree indices are an alternative to indexed-sequential files. • Disadvantage of indexed-sequential files: performance degrades as file grows, both for index lookups and for sequential scans through the data. • Advantage of B+ -Tree index files: automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. • Disadvantage of B+ -Trees: extra insertion and deletion overhead, space overhead. • Advantages of B+ -Trees outweigh disadvantages, and they are used extensively. + B -Tree Index Files (Cont.) A B+ -Tree is a rooted tree satisfying the following properties: •All paths from root to leaf are of the same length •Each node that is not a root or a leaf has between n/ 2 and n children •A leaf node has between (n-1)/2 and n-1 values •Special cases:if the root is not a leaf it has at least 2 children; if the root is a leaf (that is there are no other nodes in the tree) it can have between 0 and (n-1) values. B+ -Tree Node Structure •Typical node –Ki are the search-key values –Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). •The search-keys in a node are ordered K1 < K2 < K3 < ... Leaf Nodes in B+ -Trees Properties of a leaf node: • For i = 1, 2, . . . , n-1, i either points to a file record with search-key value Ki , or to a bucket of pointers to file records, each record having search-key value Ki . Only need bucket structure if search-key does not form a primary key. • If Li , Lj are leaf nodes and i < j , Li ‘s search-key values are less than Lj ’s search-key values • Pn points to next leaf node in search-key order Non-Leaf Nodes in B+ -Trees • The nonleaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with n pointers: –All the search-keys in the subtree to which P1 points are less than K1 –For 2 i n-1 all the search-key in the subtree to which Pi points have values greater than or equal to Ki-1 and less than Ki –All the search-keys in the subtree to which Pn points are greater than or equal to Kn-1 Example of a + B -tree Queries on B+-Trees (Cont.) • In processing a query, a path is traversed in the tree from the root to some leaf node. • If there are K search-key values in the file, the path is no longer than log n/ 2 (K ). • A node is generally the same size as a disk block, typically 4 kilobytes, and n is typically around 100 (40 bytes per index entry). • With 1 million search key values and n = 100, at most log50 (1, 000, 000) = 4 nodes are accessed in a lookup. • Contrast this with a balanced binary tree with 1 million search key values --- around 20 nodes are accessed in a lookup –above difference is significant since every node access may need a disk I/O, costing around 30 millisecond! Updates on B+-Trees: Insertion (Cont.) • Splitting a node: –take the n (search-key value, pointer) pairs (including the one being inserted) in sorted order. Place the first n/ 2 in the original node, and the rest in a new node. –let the new node be p, and let k be the least key value in p. Insert (k, p) in the parent of the node being split. If the parent is full, split it and propagate the split further up. • The splitting of nodes proceeds upwards till a node that is not full is found. In the worst case the root node may be split increasing the height of the tree by 1. Updates on B+-Trees: Insertion (Cont.) Self-Practice + •Construct a B -tree for the following set of key values: (2, 3, 5, 7, 11, 17, 19, 23, 29, 31) with n = 6