Database applications Traditional applications: More recent applications: TDDD63: Database systems Jose M. Peña jose.m.pena@liu.se What is a database? Database: Data: Mini-world: Database environment A collection of related data. Known facts that can be recorded and have an implicit meaning. Some part of the real world about which data is stored in a database. A software package/ system to facilitate the creation and maintenance of a computerized database. Database system: JMP Example of a database Database management system (DBMS): Bioinformatics. Multimedia databases. Geographic information systems (GIS). Data warehouses. Real-time and active databases. … A database represents some aspect of the real world, i.e. a mini world. A database consists of a logical coherent collection of data. A database is built with some purpose in mind. Basic definitions Numeric and textual databases. The DBMS software together with the data itself. 1 Main characteristics Self-describing nature of a database system: Two main activities: Database design. Applications design. Allows changing data structures and storage organization without having to change the DBMS access programs. Focus in this lecture on database design. Applications design focuses on the programs and interfaces that access the database. Data abstraction: A DBMS catalog stores the description of a particular database (e.g. data structures, types, and constraints). This allows the DBMS software to work with different database applications. Insulation between programs and data: Database design Programs refer to the abstract model rather than data storage details. Support of multiple views of the data: Each user may see a different view of the database, which describes only the data of interest to that user. Database design Generally considered part of software engineering. Entity-relationship (ER) model High-level data model. An overview of the database. to discuss with non-database experts. Easy to translate to data model of DBMS. Easy Entity-relationship (ER) model Street PN Address Example of ER modelling Phone PostNumber Town Employee o TaxiCertifDate DrivingLicenseDate Driver Operator 1 DrivingLicenseType 1 assign drives N DepTime Destination YearOfManuf RegNumber ServiceDate DestTime ID JMP Type N Trip DeparturePlace ER diagram. N made_by 1 Places Car A taxi company needs to model their activities. There are two types of employees in the company: drivers and operators. For drivers it is interesting to know the date of issue and type of the driving license, and the date of issue of the taxi driver’s certificate. For all employees it is interesting to know their personal number, address and the available phone numbers. The company owns a number of cars. For each car there is a need to know its type, year of manufacturing, number of places in the car and date of the last service. The company wants to have a record of car trips. A taxi may be picked on a street or ordered through an operator who assigns the order to a certain driver and a car. Departure and destination addresses together with times should also be recorded. 2 ER model Database design process Street PN Address Phone PostNumber Town Employee o TaxiCertifDate DrivingLicenseDate Driver Operator 1 DrivingLicenseType 1 assign drives N DepTime Destination Type N YearOfManuf Trip DeparturePlace RegNumber ServiceDate DestTime N ID Places 1 made_by Car Relational model Relational model String shorter than 30 chars Attributes ... EMPLOYEE Tuples ... ... Integer 400 < x < 8000 Character M or F Domain yyyy-mm-dd FNAME M LNAME SSN BDATE ADDRESS S SALARY SUPERSSN DNO M LNAME SSN BDATE ADDRESS S SALARY SUPERSSN DNO K Narayan 666884444 1962-09-15 … FNAME Ramesh M 38000 888665555 5 Joyce A English 453453453 1972-07-31 … F 25000 888665555 5 Ramesh K Narayan 666884444 1962-09-15 … M 38000 888665555 5 Ahmad V Jabbar 987987987 1969-03-29 … M 25000 888665555 4 Joyce Null English 453453453 1972-07-31 … F 38000 888665555 5 James E Borg 888665555 1937-11-10 … M 55000 null 1 Ahmad V Jabbar 987987987 1969-03-29 … M 25000 888665555 4 James Null Borg 888665555 1937-11-10 … M 55000 EMPLOYEE Relation: Set of tuples, i.e. no duplicates are allowed. Database: Collection of relations. Null 1 NULL value Relational model Relational model Foreign keys EMPLOYEE EMPLOYEE FNAME M LNAME SSN BDATE S SALARY SUPERSSN FNAME M LNAME SSN BDATE ADDRESS S SALARY SUPERSSN DNO Ramesh K Narayan 666884444 1962-09-15 … M 38000 888665555 5 Joyce A English 453453453 1972-07-31 … F 25000 888665555 5 DNO Ramesh K Narayan 666884444 1962-09-15 … M 38000 888665555 5 Ahmad V Jabbar 987987987 1969-03-29 … M 25000 888665555 4 Joyce Null English 453453453 1972-07-31 … F 38000 888665555 5 James E Borg 888665555 1937-11-10 … M 55000 Null 1 Ahmad V Jabbar 987987987 1969-03-29 … M 25000 888665555 4 James Null Borg 888665555 1937-11-10 … M 55000 Null 1 Entity integrity constraint JMP ADDRESS Referential integrity constraint DEPARTMENT DNAME DNUMBER MGRSSN MGRSTARTDATE Research 5 666884444 1988-05-22 Administration 4 987987987 1995-01-01 Headquarters 1 888665555 1981-06-19 3 Translation ER to relational model Translation ER to relational model Done by running an algorithm. EMPLOYEE(PN, Street, PostNumber, Town) DRIVER(PN, DrivingLicenseDate, DrivingLicenseType) OPERATOR(PN) TRIP(ID, Driver, Operator, Car, DepTime, …) CAR(RegNumber, …) PHONE(PN,Number) SQL relational data model relation SQL table attribute column tuple row Used by the DBMS to manipulate relational models. Declarative (what data to get, not how). Definition: CREATE, ALTER, DROP Query: SELECT Update: INSERT, DELETE, UPDATE Advanced: Procedures, functions, flow control, triggers, exception handling, … Database design process Hands-on: MySQL Most used DBMS. Broad subset of ANSI SQL 99. Open source. Check mysql.com Swedish founders. Now, owned by Oracle. Storage hierarchy CPU • Cache memory • Main memory Primary storage • Disk • Tape Secondary storage (fast, small, expensive, volatile, accessible by CPU) (slow, big, cheap, permanent, inaccessible by CPU) Databases JMP 4 Disk sector Disk Formatting divides the hard-coded sectors into equal-sized blocks. Block is the unit of transfer of data between disk and main memory, e.g. So, read/write to3 disk is a bottleneck: Files and records Sorted files Records ordered according to some field. Then, JMP Cheap record retrieval by performing binary search (on the ordering field, otherwise expensive). Expensive record addition, but less expensive record deletion (deletion markers + periodic reorganization). Disk access sec. 8 Main memory access sec. 9 CPU instruction sec. 10 10 10 Heap files Data stored in files. File is a sequence of records. Record is a set of field values. For instance, file = relation, record = entity, and field = attribute. Read = copy block from disk to buffer in main memory. Write = the opposite way. R/w time = seek time + rotational delay + block transfer time. Records are added to the end of the file. Then, Cheap record addition. Expensive record retrieval, removal and update, since they imply linear search. Moreover, record removal implies waste of space. So, periodic reorganization. Hash files The hash function (e.g. position = field mod r) returns a bucket number, where a bucket is one or several contiguous disk blocks. Collisions are typically resolved via overflow area. Cheapest random record retrieval (when searching for equality). Expensive ordered record retrieval. 5 Indexes File organization (heap, sorted, hash) determines the primary access method. However, this may not be fast enough. Solution: Indexes or secondary access method. Primary index Faster to access a random record via a binary search in the index than in the data file. Mind the index maintenance cost. Primary index. Secondary and clustering indexes. Multilevel indexes Always: Ordered index file Assumption: Ordered data file Primary access method: Primary access method: Binary search ! Binary search ! Transactions A transaction is a logical unit of database processing and consists of one or several operations. Simplified operations in a transaction: Read-item(X) Write-item(X) Example: T1 T2 Read-item(my-account) Read-item(my-account) my-account := my-account - 2000 my-account := my-account +1000 Write-item(my-account) Read-item(other-account) Write-item(my-account) other-account := other-account + 2000 Write-item(other-account) ACID properties for transactions Isolation: Concurrency control Atomicity: A transaction is an atomic unit, i.e. it is either executed completely or not at all. Consistency: A database that is in a consistent state before the execution of a transaction is also in a consistent state after the execution of the transaction. Isolation: A transaction should act as if it is executed isolated from the other transactions. Durability: Changes in the database made by a committed transaction are permanent. Read-lock(my-account1) Read-lock(my-account1) Who ensures that the ACID property are satisfied ? Read-item(my-account1) Read-item(my-account1) Write-lock(my-account2) Unlock(my-account1) Unlock(my-account1) Read-item(my-account2) Write-lock(my-account2) Read-item(my-account2) my-account2 := my-account2 + 2000 Write-item(my-account2) my-account2 := my-account2 + 2000 Write-item(my-account2) Unlock(my-account2) Unlock(my-account2) JMP A transaction follows the two-phase locking (2PL) protocol if all Read-lock() and Write-lock() operations come before the first Unlock() operation in the transaction. Example: Atomicity: Recovery system. Consistency preservation: Programmer + DBMS. Isolation: Concurrency control. Durability: Recovery system. We are assuming multiple users but just one CPU. T1 T2 6 Atomicity and durability: Assumption: Deferred update Recovery system NO-UNDO REDO T4 System log T1 start-transaction T1 write-item T1, D, 10, 20 commit T1 checkpoint start-transaction T4 write-item T4, B, 10, 20 write-item T4, A, 5, 10 commit T4 start-transaction T2 write-item T2, B, 20, 15 start-transaction T3 write-item T3, A, 10, 30 write-item T2, D, 20, 25 CRASH Data mining Knowledge discovery from data. Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from data T4 T2 Not everything is data mining, e.g. simple search and query processing. T3 checkpoint TIME Data mining crash Data mining Pattern Evaluation Input Data Data Mining Data PreProcessing Data Mining PostProcessing Task-relevant Data Data Warehouse Selection Data integration Normalization Feature selection Dimension reduction Data Cleaning Pattern discovery Association analysis Classification Clustering Outlier analysis … Pattern Pattern Pattern Pattern evaluation selection interpretation visualization Data Integration Databases Data mining JMP Association analysis: Data mining Classification (supervised learning): Find items that are frequently purchased together. Construct models based on some training examples. E.g., 50 % of my customers buy beer and diapers. Use the models to predict not-seen-before examples. Build accurate association rules from those sets of items. Applications: Credit card fraud detection, medical diagnosis, … E.g., Diaper Beer [support = 0.5, confidence = 0.75]. Knowledge = model. Support = p (Diaper, Beer). Confidence = p (Beer | Diaper). Association = correlation != causality. But we can pretend they are the same. Then, Increase sales of beer by giving a discount for diapers. Prevent drops in sales of beer by having diapers on stock. Clustering (unsupervised learning): Group data into homogeneous groups. Applications: Cluster customers to tailor your products to the different groups, … Knowledge = clusters. 7 Hands-on: MySQL + Weka Collection of data mining techniques, such as data pre-processing, classification, regression, clustering, association rules, and visualization. Written in Java. Connected with MySQL via JDBC driver. Open source. Check http://www.cs.waikato.ac.nz/~ml/weka/ Summary Focus on relational databases: Non-relational databases: NoSQL. Data mining: JMP ER diagram. Relational model. Physical model. Concurrency control and recovery systems. MySQL. Overview. Association analysis. Weka. 8