Data Warehouses - American University

advertisement
Data Warehouses
BUAD/American University
Data Warehouses
1
Definition
• Data Warehouse: An integrated and
consistent store of subject-oriented data that
is obtained from a variety of sources and
formatted into a meaningful context to
support decision-making in an organization.
BUAD/American University
Data Warehouses
2
Need for
Data Warehousing
• Integrated, company-wide view of highquality information.
• Separation of operational and informational
systems and data
– operational system: a system that is used to run
a business in real time, based on current data
– informational system: systems designed to
support decision making based on stable pointin-time or historical data
BUAD/American University
Data Warehouses
3
Factors Allowing
Data Warehousing
• Relational DBMS.
• Advances in hardware: speed and storage
capacity.
• End-user computing interfaces and tools.
BUAD/American University
Data Warehouses
4
Data Warehouse Architectures
• Two-level
– source system files containing operational data
– transformed and integrated data warehouse
• Three-level
– Operational data.
– Enterprise data warehouse (EDW)- single source of
data for decision making.
– Data marts - limited scope; data selected from
EDW; customized decision-support for individual
user groups
BUAD/American University
Data Warehouses
5
Generic data warehouse architecture
BUAD/American University
Data Warehouses
6
Three-layer
architecture
BUAD/American University
Data Warehouses
7
Reasons for the
Three-Level Architecture
• EDW and data marts have different
purposes and data architectures.
• Data transformation is complex and is best
performed in two steps.
• Data marts customized decision support for
different groups
• Architecture
– Operational data, reconciled data, Derived data.
BUAD/American University
Data Warehouses
8
Three-layer data architecture
BUAD/American University
Data Warehouses
9
Data Characteristics
• Status vs. Event data.
– A transaction is a business activity that triggers one
or more business events: event data captures them
• Transient vs. Periodic data.
– Transient: data in which changes to existing
records are written over previous records, thus
destroying previous data content
– periodic data: data that are never physically altered
or deleted once added
BUAD/American University
Data Warehouses
10
Example of
DBMS log entry
BUAD/American University
Data Warehouses
11
Transient
operational
data
BUAD/American University
Data Warehouses
12
Reconciled Data
Characteristics
•
•
•
•
•
Detailed
Historical
Normalized
Enterprise-wide
Quality controlled
BUAD/American University
Data Warehouses
13
The Data Reconciliation Process
• Capture: capture the relevant data from
source files to fill EDW
– Static - initial load.
– Incremental - ongoing update.
• Scrub or data cleansing
– missing data, name reconciliation
– Pattern recognition and other artificial
intelligence techniques.
BUAD/American University
Data Warehouses
14
Steps in data reconciliation
BUAD/American University
Data Warehouses
15
The Data Reconciliation Process
• Transform
– Convert the data format from the source to the target
system.
– Record-Level Functions
• Selection.
• Joining.
• Aggregation (for data marts).
– Field-Level Functions
• Single-field transformation
• Multi-field transformation
BUAD/American University
Data Warehouses
16
The Data Reconciliation Process
• Load and Index
– Refresh Mode
• When the warehouse is first created.
• Static data capture.
– Update Mode
• Ongoing update of the warehouse.
• Incremental data capture.
BUAD/American University
Data Warehouses
17
Derived Data
Characteristics
• Type of data
– Detailed, possibly periodic.
– Aggregated.
• Distributed to departmental servers.
• Implemented in star schema.
BUAD/American University
Data Warehouses
18
Star Schema
• Also called the dimensional model.
• Fact and dimension tables.
– Fact table: consists of factual or quantitative
data about the business
– Dimension table: hold descriptive data
• Grain of a fact table - time period for each
record.
BUAD/American University
Data Warehouses
19
Components of a star schema
BUAD/American University
Data Warehouses
20
Star schema example
BUAD/American University
Data Warehouses
21
Star schema
with sample
data
BUAD/American University
Data Warehouses
22
Example of snowflake sample
BUAD/American University
Data Warehouses
23
Size of the fact table
•
•
•
•
Total number of stores: 1,000
Total number of products: 10,000
Total number of periods: 24
Total rows: 1000 * 10,000 * 24 =
240,000,000
• On average 50% items record sales,
– no of rows = 120,000,000
BUAD/American University
Data Warehouses
24
Types of Data Marts
• Dependent - Populated from the EDW.
• Independent - Data taken directly from the
operational databases.
BUAD/American University
Data Warehouses
25
The User Interface
•
•
•
•
The role of metadata.
Traditional query and reporting tools.
On-line analytical processing (OLAP)
The use of a set of graphical tools that provides
users with multidimensional views of their data
and allows them to analyze the data using
simple windowing techniques.
BUAD/American University
Data Warehouses
26
The User Interface
– Slicing a cube.
– Pivot
• Rotate the view for a particular data point to obtain
another perspective.
• E.g. take a value from the units column and obtain
by-store values.
– Drill-down
BUAD/American University
Data Warehouses
27
Slicing a data cube
BUAD/American University
Data Warehouses
28
Download