Database applications Traditional applications: More recent applications: TDDD63: Database systems

advertisement
Database applications

Traditional applications:

More recent applications:

TDDD63: Database systems




Jose M. Peña
jose.m.pena@liu.se
What is a database?




Database:

Data:

Mini-world:



Database environment
A collection of related data.
Known facts that can be recorded and have an implicit meaning.
Some part of the real world about which data is stored in a
database.
A software package/ system to facilitate the creation and
maintenance of a computerized database.
Database system:

JMP
Example of a database
Database management system (DBMS):



Bioinformatics.
Multimedia databases.
Geographic information systems (GIS).
Data warehouses.
Real-time and active databases.
…
A database represents some aspect of the real
world, i.e. a mini world.
A database consists of a logical coherent
collection of data.
A database is built with some purpose in mind.
Basic definitions


Numeric and textual databases.
The DBMS software together with the data itself.
1
Main characteristics

Self-describing nature of a database system:




Two main activities:


Database design.
Applications design.
Allows changing data structures and storage organization without having to
change the DBMS access programs.

Focus in this lecture on database design.

Applications design focuses on the programs
and interfaces that access the database.
Data abstraction:


A DBMS catalog stores the description of a particular database (e.g. data
structures, types, and constraints).
This allows the DBMS software to work with different database applications.
Insulation between programs and data:


Database design
Programs refer to the abstract model rather than data storage details.
Support of multiple views of the data:


Each user may see a different view of the database, which describes only the
data of interest to that user.
Database design
Generally considered part of software engineering.
Entity-relationship (ER) model

High-level data model.
 An
overview of the database.
to discuss with non-database experts.
 Easy to translate to data model of DBMS.
 Easy

Entity-relationship (ER) model
Street
PN
Address
Example of ER modelling
Phone
PostNumber
Town
Employee
o
TaxiCertifDate
DrivingLicenseDate
Driver
Operator
1
DrivingLicenseType
1
assign
drives
N
DepTime
Destination
YearOfManuf
RegNumber
ServiceDate
DestTime
ID
JMP
Type
N
Trip
DeparturePlace
ER diagram.
N
made_by
1
Places
Car
A taxi company needs to model their activities.
There are two types of employees in the company: drivers
and operators. For drivers it is interesting to know the date
of issue and type of the driving license, and the date of
issue of the taxi driver’s certificate. For all employees it is
interesting to know their personal number, address and the
available phone numbers.
The company owns a number of cars. For each car there is
a need to know its type, year of manufacturing, number of
places in the car and date of the last service.
The company wants to have a record of car trips. A taxi
may be picked on a street or ordered through an operator
who assigns the order to a certain driver and a car.
Departure and destination addresses together with times
should also be recorded.
2
ER model
Database design process
Street
PN
Address
Phone
PostNumber
Town
Employee
o
TaxiCertifDate
DrivingLicenseDate
Driver
Operator
1
DrivingLicenseType
1
assign
drives
N
DepTime
Destination
Type
N
YearOfManuf
Trip
DeparturePlace
RegNumber
ServiceDate
DestTime
N
ID
Places
1
made_by
Car
Relational model
Relational model
String shorter than 30 chars
Attributes
...
EMPLOYEE
Tuples
...
...
Integer
400 < x < 8000
Character
M or F
Domain
yyyy-mm-dd
FNAME
M
LNAME
SSN
BDATE
ADDRESS
S
SALARY
SUPERSSN
DNO
M
LNAME
SSN
BDATE
ADDRESS
S
SALARY
SUPERSSN
DNO
K
Narayan
666884444
1962-09-15
…
FNAME
Ramesh
M
38000
888665555
5
Joyce
A
English
453453453
1972-07-31
…
F
25000
888665555
5
Ramesh
K
Narayan
666884444
1962-09-15
…
M
38000
888665555
5
Ahmad
V
Jabbar
987987987
1969-03-29
…
M
25000
888665555
4
Joyce
Null
English
453453453
1972-07-31
…
F
38000
888665555
5
James
E
Borg
888665555
1937-11-10
…
M
55000
null
1
Ahmad
V
Jabbar
987987987
1969-03-29
…
M
25000
888665555
4
James
Null
Borg
888665555
1937-11-10
…
M
55000
EMPLOYEE
Relation: Set of tuples, i.e. no duplicates are allowed.
Database: Collection of relations.
Null
1
NULL value
Relational model
Relational model
Foreign keys
EMPLOYEE
EMPLOYEE
FNAME
M
LNAME
SSN
BDATE
S
SALARY
SUPERSSN
FNAME
M
LNAME
SSN
BDATE
ADDRESS
S
SALARY
SUPERSSN
DNO
Ramesh
K
Narayan
666884444
1962-09-15
…
M
38000
888665555
5
Joyce
A
English
453453453
1972-07-31
…
F
25000
888665555
5
DNO
Ramesh
K
Narayan
666884444
1962-09-15
…
M
38000
888665555
5
Ahmad
V
Jabbar
987987987
1969-03-29
…
M
25000
888665555
4
Joyce
Null
English
453453453
1972-07-31
…
F
38000
888665555
5
James
E
Borg
888665555
1937-11-10
…
M
55000
Null
1
Ahmad
V
Jabbar
987987987
1969-03-29
…
M
25000
888665555
4
James
Null
Borg
888665555
1937-11-10
…
M
55000
Null
1
Entity integrity constraint
JMP
ADDRESS
Referential integrity constraint
DEPARTMENT
DNAME
DNUMBER
MGRSSN
MGRSTARTDATE
Research
5
666884444
1988-05-22
Administration
4
987987987
1995-01-01
Headquarters
1
888665555
1981-06-19
3
Translation ER to relational model

Translation ER to relational model
Done by running an algorithm.
EMPLOYEE(PN, Street, PostNumber, Town)
DRIVER(PN, DrivingLicenseDate, DrivingLicenseType)
OPERATOR(PN)
TRIP(ID, Driver, Operator, Car, DepTime, …)
CAR(RegNumber, …)
PHONE(PN,Number)
SQL
relational data model
relation
SQL
table
attribute
column
tuple
row
Used by the DBMS to manipulate
relational models.
 Declarative (what data to get, not how).
 Definition: CREATE, ALTER, DROP
 Query: SELECT
 Update: INSERT, DELETE, UPDATE
 Advanced: Procedures, functions, flow
control, triggers, exception handling, …

Database design process
Hands-on: MySQL
Most used DBMS.
 Broad subset of ANSI SQL 99.
 Open source. Check mysql.com
 Swedish founders. Now, owned by Oracle.

Storage hierarchy
CPU
• Cache memory
• Main memory
Primary storage
• Disk
• Tape
Secondary storage
(fast, small, expensive, volatile,
accessible by CPU)
(slow, big, cheap, permanent,
inaccessible by CPU)
Databases
JMP
4
Disk
sector
Disk
Formatting divides the hard-coded sectors
into equal-sized blocks.
Block is the unit of transfer of data
between disk and main memory, e.g.






So, read/write to3 disk is a bottleneck:



Files and records
Sorted files

Records ordered according to some field.
Then,


JMP
Cheap record retrieval by performing binary
search (on the ordering field, otherwise
expensive).
Expensive record addition, but less
expensive record deletion (deletion markers +
periodic reorganization).
Disk access 
sec.
8
Main memory access 
sec.
9
CPU instruction 
sec.
10
10
10
Heap files

Data stored in files.
 File is a sequence of records.
 Record is a set of field values.
 For instance, file = relation, record = entity,
and field = attribute.

Read = copy block from disk to buffer in main memory.
Write = the opposite way.
R/w time = seek time + rotational delay + block transfer time.
Records are added to the end of the file.
Then,



Cheap record addition.
Expensive record retrieval, removal and update,
since they imply linear search.
Moreover, record removal implies waste of
space. So, periodic reorganization.
Hash files
The hash function (e.g. position = field
mod r) returns a bucket number, where a
bucket is one or several contiguous disk
blocks.
 Collisions are typically resolved via
overflow area.
 Cheapest random record retrieval (when
searching for equality).
 Expensive ordered record retrieval.

5
Indexes

File organization (heap, sorted, hash)
determines the primary access method.

However, this may not be fast enough.

Solution: Indexes or secondary access
method.


Primary index
Faster to access a random record via a
binary search in the index than in the data
file.
 Mind the index maintenance cost.

Primary index.
Secondary and clustering indexes.
Multilevel indexes
Always: Ordered index file Assumption: Ordered data file
Primary access method:
Primary access method:
Binary search !
Binary search !
Transactions


A transaction is a logical unit of database processing and
consists of one or several operations.
Simplified operations in a transaction:



Read-item(X)
Write-item(X)
Example:
T1
T2
Read-item(my-account)
Read-item(my-account)
my-account := my-account - 2000
my-account := my-account +1000
Write-item(my-account)
Read-item(other-account)
Write-item(my-account)
other-account := other-account + 2000
Write-item(other-account)
ACID properties for transactions
Isolation: Concurrency control

Atomicity: A transaction is an atomic unit, i.e. it is either executed completely or not at all.

Consistency: A database that is in a consistent state before the execution of a
transaction is also in a consistent state after the execution of the transaction.

Isolation: A transaction should act as if it is executed isolated from the other transactions.

Durability: Changes in the database made by a committed transaction are permanent.
Read-lock(my-account1)
Read-lock(my-account1)

Who ensures that the ACID property are satisfied ?
Read-item(my-account1)
Read-item(my-account1)
Write-lock(my-account2)
Unlock(my-account1)
Unlock(my-account1)
Read-item(my-account2)
Write-lock(my-account2)
Read-item(my-account2)
my-account2 := my-account2 + 2000
Write-item(my-account2)
my-account2 := my-account2 + 2000
Write-item(my-account2)
Unlock(my-account2)
Unlock(my-account2)





JMP
A transaction follows the two-phase locking (2PL) protocol if all
Read-lock() and Write-lock() operations come before the first
Unlock() operation in the transaction.
Example:

Atomicity: Recovery system.
Consistency preservation: Programmer + DBMS.
Isolation: Concurrency control.
Durability: Recovery system.
We are assuming multiple users but just one CPU.

T1
T2
6
Atomicity and durability:
Assumption: Deferred update
Recovery system
NO-UNDO
REDO T4
System log
T1
start-transaction T1
write-item T1, D, 10, 20
commit T1
checkpoint
start-transaction T4
write-item T4, B, 10, 20
write-item T4, A, 5, 10
commit T4
start-transaction T2
write-item T2, B, 20, 15
start-transaction T3
write-item T3, A, 10, 30
write-item T2, D, 20, 25
CRASH
Data mining

Knowledge discovery from data.

Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from data
T4

T2
Not everything is data mining, e.g. simple search and
query processing.
T3
checkpoint
TIME
Data mining
crash
Data mining
Pattern Evaluation
Input Data
Data
Mining
Data PreProcessing
Data Mining
PostProcessing
Task-relevant Data
Data Warehouse
Selection
Data integration
Normalization
Feature selection
Dimension reduction
Data Cleaning
Pattern discovery
Association analysis
Classification
Clustering
Outlier analysis
…
Pattern
Pattern
Pattern
Pattern
evaluation
selection
interpretation
visualization
Data Integration
Databases
Data mining

JMP
Association analysis:
Data mining

Classification (supervised learning):

Find items that are frequently purchased together.

Construct models based on some training examples.

E.g., 50 % of my customers buy beer and diapers.

Use the models to predict not-seen-before examples.

Build accurate association rules from those sets of items.

Applications: Credit card fraud detection, medical diagnosis, …

E.g., Diaper  Beer [support = 0.5, confidence = 0.75].

Knowledge = model.

Support = p (Diaper, Beer).

Confidence = p (Beer | Diaper).

Association = correlation != causality.

But we can pretend they are the same. Then,

Increase sales of beer by giving a discount for diapers.

Prevent drops in sales of beer by having diapers on stock.

Clustering (unsupervised learning):

Group data into homogeneous groups.

Applications: Cluster customers to tailor your products to the
different groups, …

Knowledge = clusters.
7
Hands-on: MySQL + Weka




Collection of data mining techniques, such as
data pre-processing, classification, regression,
clustering, association rules, and visualization.
Written in Java.
Connected with MySQL via JDBC driver.
Open source. Check
http://www.cs.waikato.ac.nz/~ml/weka/
Summary

Focus on relational databases:







Non-relational databases: NoSQL.
Data mining:



JMP
ER diagram.
Relational model.
Physical model.
Concurrency control and recovery systems.
MySQL.
Overview.
Association analysis.
Weka.
8
Download