PPT - Bo Yuan

advertisement
Principles of
Data Warehousing
Lecturer: Dr. Bo Yuan
LOGO
E-mail: yuanb@sz.tsinghua.edu.cn
Outline
OLAP
Metadata
Data
Warehouse
ETL
Data Marts
Multidimensional Data
2
A Manager’s Questions …
Who are our lowest or
highest margin customers ?
Who are my customers
and what products
are they buying?
What is the most
effective distribution
channel?
Which customers
are most likely to go
to the competition ?
What promotions
have the biggest impact
on revenue?
What impact will
new products/services
have on revenue
and margins?
3
Tourists, Farmers and Explorers
Tourists: Browse information harvested
by farmers.
Farmers: Harvest information
from known access paths.
Explorers: Seek out the unknown and previously
unsuspected rewards hiding in the detailed data.
4
History & Evolution
 60’s: Batch Reports
 Hard to find and analyze information
 Inflexible and expensive, reprogram every new request
 70’s: Terminal-Based DSS and EIS
 Still inflexible, not integrated with desktop tools
 80’s: Desktop Data Access and Analysis Tools
 Query tools, Spreadsheets, GUIs
 Easier to use, but only access operational databases
 90’s: Data Warehousing
 OLAP Engines and Tools
5
Data Everywhere
 I cannot find the data I need.
 Data are scattered over the network.
 Many versions
 I cannot get the data I need.
 May need experts to get the data.
 I cannot understand the data I found.
 Poorly documented
 Domain knowledge
 I cannot use the data I found.
 Quality
 Transformation
6
What is a data warehouse?
“A single, complete and consistent
store of data obtained from a
variety of different sources made
available to end users in a way
that they can understand and use
in a business context.”
7
What is data warehousing?
 Data warehousing: techniques for assembling and
managing data from various sources for the purpose of
answering business questions and making decisions.
 A data warehouse is a collection of data that is used
primarily in organizational decision making.
 A data warehouse is




Knowledge
Subject-oriented
Integrated
Time-varying
Non-volatile
Data
8
Information
Data Warehouse Architecture
Relational
Databases
Optimized Loader
ERP
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Metadata Repository
9
Analyze
Query
Data Warehouse is …
 Subject-Oriented
 The data warehouse is organized around subjects of the enterprise
(e.g., customers, products, sales) rather than applications areas (e.g.,
customer invoicing, stock control, product sales).
 This is reflected in the need to store decision-support data instead of
application-oriented or operational data.
 Integrated
 The data warehouse integrates corporate application-oriented data
from different sources, which often include inconsistent data.
 The integrated data sources must be made consistent to present a
unified view of the data to the users.
10
Data Warehouse is …
 Time-Variant
 Data warehouses are time variant in the sense that they maintain both
historical and (nearly) current data.
 Historical information is of high importance to decision makers, who
often want to understand trends and relationships between data.
 Non-Volatile
 After the data are loaded into the data warehouse, there are no
changes, inserts, or deletes performed against the historical data.
 This is logical because the purpose of a warehouse is to enable you
to analyze what has occurred.
11
Operational Systems
 Operational Systems





Run the business in real time.
Based on up-to-the-second data.
Optimized to handle large numbers of simple read/write transactions.
Optimized for fast response to predefined transactions.
Used by people who deal with customers, products.
 Database systems have been used traditionally for OLTP.




Online Transaction Processing
Clerical data processing tasks
Detailed, up to date data
Structured repetitive tasks
 Examples of Operational Data
 Customer Files
 Account Balance, Call Record
 Point of Sale Data, Production Record
12
Data Warehousing vs. OLTP
 Workload
 Data warehouses are designed to accommodate ad hoc queries. A data
warehouse should be optimized to perform well for a wide variety of possible
query operations.
 OLTP systems support only predefined operations and might be specifically
tuned or designed to support only these operations.
 Data Modifications
 A data warehouse is updated on a regular basis by the ETL process (run
nightly or weekly) using bulk data modification techniques. The users of a data
warehouse do not directly update the data warehouse.
 In OLTP systems, users routinely issue individual data modification statements
to the database. The OLTP database is always up to date, and reflects the
current state of each business transaction.
13
Data Warehousing vs. OLTP
 Schema Design
 Data warehouses often use denormalized or partially denormalized
schemas (such as a star schema) to optimize query performance.
 OLTP systems often use fully normalized schemas to optimize
update/insert/delete performance, and to guarantee data consistency.
 Typical Operations
 A typical data warehouse query scans thousands or millions of rows.
For example, "Find the total sales for all customers last month."
 A typical OLTP operation accesses only a handful of records. For
example, "Retrieve the current order for this customer."
14
Data Warehousing vs. OLTP
 Historical Data
 Data warehouses usually store months or years of data to support
historical analysis.
 OLTP systems usually store data from only a few weeks or months to
meet the requirements of the current transaction.
 Number of Users
 Data Warehouses: hundreds of users.
 OLTP Systems : tens of thousands users.
 Database Size
 Data Warehouses: 10GB - 1TB
 OLTP Systems:
100M - 10GB
15
In summary …
Data warehousing helps optimize the business.
OLTP systems actually run the business.
16
Data Marts
 A data mart is a subset of an organizational data store, usually
oriented to a specific purpose or major data subject, that may
be distributed to support business needs.
 Departmental Data Warehouse
 A data warehouse tends to be a strategic but somewhat
unfinished concept; a data mart tends to be tactical and aimed
at meeting an immediate need.
 The smaller-scale data mart is typically easier to build than the
enterprise-wide warehouse; can be quickly implemented; and
offers tremendous, fast payback for the users.
 The downside comes when several department-focused data
marts are implemented with no forethought for a future data
warehouse that serves the entire enterprise.
17
Independent Data Marts
18
Dependent Data Marts
19
Data Granularity
 Granularity is the extent to which a system is broken down into small
parts, either the system itself or its description or observation.
 A key factor to consider in the design of data warehouses.
 The amount of data to be stored in the data warehouse.
 Operational Databases
 Transaction Oriented
 Detailed Records  Lowest Level of Granularity
 The details of the phone call made by Tom at 2:40pm yesterday
 Data Warehouses
 Decision Making
 Summarized Data  High Levels of Granularity
 The number of phone calls made by Tom last month
20
Data Granularity
21
Data Granularity
 High Levels of Granularity
 Reduce storage costs.
 Reduce CPU usage.
 Cannot answer certain queries.
• Did Tom call Mary last week?
 A tradeoff between the volume and the usage of data.
 Dual Levels of Granularity
 Store summarized data on disks.
• Cover 95% decision making queries.
• Data access is cheap and convenient.
 Store detailed data on tapes .
• Cover 5% decision making queries.
• Many records need to be involved to process a query.
• Data access is expensive and complicated.
 Many levels of granularity may be necessary in practice.
22
Data Partition
Acct. No
Name
Balance Date Opened
Interest Rate
Address
Frequently Accessed
Acct. No Balance
Acct. No
Name
Date Opened Interest Rate
Address
Rarely Accessed
Smaller Table & Less I/O
23
Data Quality
 Data warehouses are based on existing data sources.
 Data quality matters!
 Creating a data warehouse is not a straightforward process.
 Warehouse data are from disparate and questionable sources.
 Legacy systems are no longer documented.
 Corporate wide standards are not well implemented.
 Advanced techniques and tools are needed to do the job.
24
25
Extract, Transform & Load
 Extract, Transform & Load (ETL)
 The interface between external sources and data warehouses
 ETL may take around 70% of the total workload.
 Can be implemented manually in any programming language.
 Commercial ETL tools are widely available.
 Extract
 To consolidate data from different source systems.
•
•
•
•
•
Flat Files
Relational Databases
Customized Applications
Point of Sale Devices
Web Pages
 To locate the sources for each data item in the data warehouse.
• Not all data are to be extracted.
26
Extract, Transform & Load
 Transform
 To apply a series of rules or functions to the extracted data to derive
the data for loading into the end target.
 Typical Functions
•
•
•
•
•
•
•
Formatting
Encoding
Aggregating
Splitting
Deriving
Converting
Integrating
 Load
 To load the extracted, cleaned and validated data into the end target.
• Online vs. Offline Loads
• Incremental vs. Full Loads
27
ETL --- Challenges
Savings
Same data
different name
Trust
Loans
Different data
same name
Credit Card
Inconsistent
name or data
28
ETL --- Challenges
External Sources
appl
appl
appl
appl
A - m,f
B - 1,0
C - x,y
D - male, female
appl
appl
appl
appl
A - pipeline - cm
B - pipeline - in
C - pipeline - feet
D - pipeline - yds
appl
appl
appl
appl
A - balance
B - bal
C - currbal
D - balcurr
Data Warehouse
29
ETL --- Challenges
 Same person, different spellings
 吕: LV, LUI, LYU
 Multiple ways to denote company name
 Global Systems, GSPL, Global Pty. LTD.
 Use of different names for the same object/concept
 Holland vs. Netherland
 Inconsistent data values
 Age, Marital Status …
 Required fields left blank
 Missing Values
 Invalid product codes collected at point of sale
 Manual entry leads to mistakes.
 Different conventions: using “-1” or “99999” to indicate an error
30
Metadata
 Metadata is information about data.
 Metadata is used to facilitate the understanding, characteristics, and
management usage of data.
 Metadata can document data about data attributes & structure.
 Metadata may include descriptive information about the context, quality
and condition, or characteristics of the data.
 Metadata for a Book
 Title, Author, Subject, ISBN, Number of Pages …
 Metadata for a data warehouse
 The data defining warehouse objects
 A roadmap telling users what are in there and how to find them
 Far more sophisticated than a data dictionary
31
Metadata Repository
 Data definition and mapping metadata
 The meaning of each attribute and where the data come from
 Data structure metadata
 The structure of the tables (the data type of each column, primary/foreign key)
 Source system metadata
 The data structure of all the source systems feeding in the warehouse
 ETL process metadata
 The description of each data flow (source, target, transformation, schedule)
 Data quality metadata
 Data quality rules and where they are applicable for, their risk level and actions
 Audit metadata
 The results of all processes (ETL, security log, indexing) in the warehouse
 Usage metadata
 Records about which reports and cubes are used by who and when
32
Data Models in Data Warehouses
 In OLTP systems, data are stored in 2D matrixes.
 Data warehouses are subject-oriented
 Profits, Sales …
 Data need to be reorganized to better reflect the subjects.
 A data warehouse is based on a multidimensional data model, which views
data in the form of a data cube.
 A data cube allows data to be modeled and viewed in multiple dimensions.
 Fact tables contain measures of interest (such as dollars sold) and keys to
each of the related dimension tables.
 Dimension tables provide the context of the measures such as item (item
name, brand), product, location or time(day, week, month, quarter, year).
33
From Tables to Data Cubes
ID
Product
Country
Date
Sales
1
TV
US
1Qtr
100
2
PC
Canada
4Qtr
500
3
CAR
US
2Qtr
30
4
PC
UK
3Qtr
200
5
CAR
UK
1Qtr
20
6
CAR
UK
2Qtr
15
7
TV
Canada
4Qtr
80
34
From Tables to Data Cubes
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
U.K.
sum
35
Country
TV
PC
CAR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
Cube: A Lattice of Cuboids
all
time
time,item
0-D cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D cuboid
time, item, location, supplier
36
Data Warehouse Schemas
 Star Schema
 A fact table in the middle connected to a set of dimension tables
 Snowflake Schema
 A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape
similar to snowflake
 Fact Constellations
 Multiple fact tables sharing dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact constellation
37
The Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
38
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
The Star Schema: An Example
product
prodId
p1
p2
name price
bolt
10
nut
5
sale oderId date
o100 1/7/97
o102 2/7/97
105 3/8/97
customer
custId
53
81
111
store storeId
c1
c2
c3
custId
53
53
111
name
joe
fred
sally
prodId
p1
p2
p1
storeId
c1
c1
c3
address
10 main
12 main
80 willow
39
qty
1
2
5
amt
12
11
50
city
sfo
sfo
la
city
nyc
sfo
la
The Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
item_key
item_name
brand
type
supplier_key
Sales Fact Table
time_key
item_key
branch_key
branch
supplier_key
supplier_type
location
location_key
branch_key
branch_name
branch_type
supplier
location_key
street
city_key
units_sold
dollars_sold
city
city_key
city
province_or_street
country
avg_sales
Measures
40
The Galaxy Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
Measures
41
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
Concept Hierarchy
Location
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
42
...
Toronto
M. Wind
Set-Grouping Hierarchy
[$0 - $1000]
inexpensive
moderate
expensive
[$0 - $150]
43
View of Hierarchies
44
Bitmap Index
 Index on a particular column.
 Each value in the column corresponds to a bit vector.
 The length of the bit vector: # of unique records.
 Not suitable for high cardinality domains
Base Table
Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
Index on Region
Index on Type
Type RecID Asia Europe America RecID Retail Dealer
1
1
0
1
1
0
0
Retail
2
0
1
2
0
1
0
Dealer
3
0
1
3
1
0
0
Dealer
4
1
0
4
0
0
1
Retail
5
0
1
5
0
1
0
Dealer
45
OLAP
 Online Analytical Processing
 Fast Analysis of Shared Multidimensional Information (FASMI)
 Slice and Dice:
 Project and Select
 Roll up (drill-up): summarize data
 By climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 From higher level summary to lower level summary or detailed data,
or introducing new dimensions
 Pivot (rotate):
 Reorient the cube
46
Browsing a Data Cube
47
Slicing and dicing
The Telecomm Slice
Product
Household
Telecomm
Video
Audio
Europe
Far East
India
Retail Direct
Sales Channel
Special
48
Roll-Up & Drill-Down
Higher Level of
Aggregation
 Sales Channel
 Region
 Country
 State
 Location Address
 Sales Representative
Low-level
Details
49
50
Pivot
10
Juice
47
Cola
Milk
Cream
30
12
Product
3/1 3/2 3/3 3/4
51
OLAP Server Architectures
 Relational OLAP (ROLAP)
 Use relational DBMS to store and manage warehouse data.
 ROLAP tools access the data in a relational database and generate SQL
queries to calculate information at the appropriate level as required.
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Fast query performance due to optimized storage and indexing
 Automated computation of higher level aggregates of the data
 Very compact for low dimension data sets.
 Array model provides natural indexing
 Hybrid OLAP (HOLAP)
 User flexibility
 Low level: relational
 High-level: array
52
Warehouse Products
 Computer Associates -- CA-Ingres
 Hewlett-Packard -- Allbase/SQL
 Informix -- Informix, Informix XPS
 Microsoft -- SQL Server
 Oracle -- Oracle 7, Oracle Parallel Server
 Red Brick -- Red Brick Warehouse
 SAS Institute -- SAS
 Software AG -- ADABAS
 Sybase -- SQL Server, IQ, MPP
53
Data Warehouse Vendors
54
Data Warehouse Vendors
55
Review
 What is a data warehouse?
 What is data warehousing?
 What is the difference between OLTP and data warehousing?
 What does ETL stand for?
 What is the meaning of Metadata?
 What is the star schema?
 What is the snowflake schema?
 What is an OLAP cube?
 What are the most common OLAP operations?
56
Next Week’s Class Talk
 Volunteers are required for next week’s class talk.
 Topic: Business Intelligence
 Length: 20 minutes plus question time
 Suggested Points of Interest
 Aim & Scope
• Techniques involved
 Market
• Vendors & Products
 Typical applications
• Supermarkets, Airlines, Financial Institutes …
 Prospect of employment
• Major BI companies
 The future of BI
• Development trends
57
Project Option--- Data Warehousing
 Aim
 To gain hand-on experiences on data warehousing.
 To get familiar with popular data warehousing software.
 To build up teamwork and interpersonal skills.
 Deliverables
 Reports
 Oral Presentation or Poster
 Due
 Reports must be submitted before Week 14.
 Oral presentations and posters are scheduled on Week 15.
 Software
 PowerOLAP
 InstantOLAP
 Pentaho
58
Download