Data Warehousing Notes

advertisement
Data Warehouse Notes
OLTP vs OLAP

OLTP: Online Transaction Processing

Most common type of database for data input

ER design approach

Database design looks like “real world”

Throughput and response time are big concerns
Exercise: Examples of transactional databases
Physical Data Storage and Usage in OLTP System:

Data stored on disk in data pages (4k, 8k, 16k, 32k in size)

Data IO is also in data pages

Memory image of database table is also data pages

Need one row of one table, read an entire data page.

Rows per Page ~ Page Size/Bytes per Row
Physical storage of rows of a single table
...
A
B
...
C
...
...
Exercise: Cost of Retrieval (A,B) vs (C,D)
D
...
...
Data Input Strategies

All transactions get added to the end of the last data page in the
chain of data pages that make up this table.


single point of contention slows everyone down
most reporting is looking at older data on different data
pages so not blocked by data input on newly created data
pages
report accesses these pages
...
...
...
...
...
...
new rows
added here
Data Input Strategies

Transactions with similar key values are placed on the same
data pages (eg, orders from the same state)


multiple insertion points so less chance of insertions running
into each other
reporting now runs into problems reporting from the same
pages where data is being entered
MA
RI
CT
data insertion and reporting
use the same pages
...
...
...
...
insertion
...
...
...
Solution 1

Have OLTP and OLAP run on different copies of the same
database – live and day-old.



Each night copy all new transactions to day-old database
Or mirror data insertion in a duplicate database
Problems:

Insertion and Reporting are “orthogonal” so the structure that
makes one task easy makes the other difficult.

Insertion typically restricted to one external view at a time

Reporting is often across many external views

Insertion easy to design and designed only once

Reports hard to design and new reports all the time
Solution 2:


OLAP database has a different structure than the OLTP
database.
Most common structure is called the Star Schema.
Dimension
Tables
Fact
Table
Address =
{address, country, state,
region}
Alternative to Star Schema:


OLAP database has a different structure than the OLTP
database.
Another common structure is called the Snowflake Schema.
Dimension
Tables
Fact
Table
Address =
{address, county_id}
Country = {id, state_id}
State = {id, county_id}
County = {id, ...}
Fact Table:
Fact table schema
consists of numeric facts
called measures
categorized by
dimensions
Dimensions are qualitative
properties of the data
and measures are
quantitative properties.
ID
D1_ID
D2_ID
D3_ID
...
m1
m2
m3
...
D1
ID
...
D2
ID
...
D3
ID
...
Example Fact Table
HMO_CLAIM_HISTORY
claim_type
patient_state_code
claim_user_id
patient_claim_id
prmy_procedure_id
prmy_diagnosis_id
claim_from_date
..............................
claim_pymt_amt
ddctbl_amt
coinsurance_amt
(cp_amt, dd_amt,coin_amt)
state
claim_type
data cube
Improvement?

All transactions fit into a single table

Every report is a slight modification of every other report



select new dimensions
select new measures
Problems:



Fact table can get very large (terabytes)
New business decisions lead to new dimensions so table
has to be restructured
Smaller cubes are better for reporting since more efficient.
Fact Table and daily, materialized report tables:
Add additional summary
tables that are updated
as the main fact table is
updated
Report #1
ID
D1_ID
D2_ID
D3_ID
...
m1
m2
m3
...
D1
ID
...
D2
ID
...
D3
ID
...
D1.Name
D2.Name
...
AggD3ImpactOn_m1
...
select d1.Name,d2.Name, sum(m1) as AggD3ImpactOn_m1
from FactTable f, D1 d1, D2 d2
where f.D1_ID = d1.ID and f.D2_ID = d2.ID
group by D1_ID,D2_ID
Exercise:



Suppose that the Library database is used in several branch
libraries of a large system.
Suppose that when a book is returned, a permanent record of
the loan is put into a central fact table in a central data
warehouse.
What kind of data might be put in the new fact table? What are
the dimensions? Any measures?
Library OLAP Schema:
Cardholder
Book
Copy
LibraryBranch
Library_ID
Borrower_ID
Accession_No
isbn
loan_date
return_date
return_code
days_on_loan
ETL:


Extraction-Transformation-Load
Process by which data is copied from OLTP source databases
to a single OLAP report database/data warehouse
Download