Populating Data Warehouse Structures

advertisement
Data Warehouse
Why Data Warehouse and OLAP?
 The Intelligence Stage of Decision Making needs correct
data
 Data also should be clean, organized with fast access
 Often the data should only be used for subsequent stages
and should not be changed (in other words decision makers
are users of data, not generators of data)
 The data for decision making comes from multiple sources

OLTP databases, XML files, Flat files like CSV, PDF, etc.
 They need to be combined for one version of the truth
 These needs cannot be easily served by the traditional data
sources, hence the concept of data warehouse/data marts
Business Intelligence Process
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Data Warehouse: Definitions
 Data warehouse

There is no single definition, as it can encompass many aspects
of the data management. Some of the ones found in the
literature are:
A physical repository where relational data are specially organized
to provide enterprise-wide, cleansed data in a standardized format
Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from
the organization’s operational database.
 A consistent database source that bring together information from
multiple sources for decision support queries
 Support information processing by providing a solid platform of
consolidated, historical data for analysis

 Data warehousing

A process by which data from multiple sources are extracted,
transformed and loaded into a data warehouse in a planned
operation
Data Warehouse vs. Operational DBMS
• OLTP



(on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
Aims at reliable and efficient processing of a large number of
transaction and ensuring data consistency
• OLAP (on-line analytical processing)




Major task of data warehouse system
Data analysis and decision making
Aims at efficient multidimensional processing of large data volumes
Fast, interactive answers to large aggregate queries
• Distinct features (OLTP vs. OLAP):





User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
User
Clerk, IT Professional
Knowledge worker
Function
Day to day operations
Decision support
DB Design
Application-oriented
Subject-oriented
Data
Current, Isolated
Historical, Consolidated
View
Detailed, Flat relational
Summarized and
Multidimensional
Usage
Structured, Repetitive
Ad hoc
Unit of work
Short, Simple
transaction
Complex query
Access
Read/write
Read Mostly
Operations
Index/hash on prim. Key
Lots of Scans
# Rec. accessed Tens
Millions
#Users
Thousands
Hundreds
Db size
100 MB-GB
100GB-TB
Metric
Trans. throughput
Query throughput, response
Data Warehousing Characteristics
 Subject oriented
 Integrated
 Time variant (time series)
 Nonvolatile
 Web based
 Relational/multidimensional
 Client/server
 Real-time
 Include metadata
Data Warehouse Architecture
 Data Warehouse Architecture
 Inmon (Bill Inmon) Model: EDW approach
 Kimball (Ralph Kimball) Model: Data mart approach
 Which model is best?
 There is no one-size-fits-all strategy to data
warehousing
 One alternative is the hosted warehouse
Alternative DW Architectures
Teradata’s DW Architecture
DW Architecture
Categories/Types of Data Warehouses
 Data Mart
 Operational Data Stores
 A departmental data
 A type of database often
warehouse that stores only used as an interim area for
relevant data
a data warehouse,
especially for customer
 Dependent Data Mart
information files
 A subset that is created
 Operational Data Marts
directly from a data
warehouse
 An operational data mart.
 Independent Data Mart An operational data mart
is a small-scale data mart
 A small data warehouse
typically used by a single
designed for a strategic
department or functional
business unit or a
area in an organization
department
Various OLAPs
 ROLAP: Relational OLAP. Analysis cube supported
by a Multi-Dimensional Data Warehouse (or Data
Mart) based on Relational database.
 MOLAP: Multidimensional OLAP. Analysis cube
supported by a multi-dimensional data warehouse
based on another multi-dimensional data storage.
 HOLAP: Hybrid OLAP. Combination of the above
two to optimize performance.
 Your task: Search the web to find the relative
advantages and disadvantages of each.
Data Warehousing Architectures
 Issues to consider when deciding which architecture
to use:




Which database management system (DBMS) should be used?
Will parallel processing and/or partitioning be used?
Will data migration tools be used to load the data warehouse?
What tools will be used to support data retrieval and analysis?
Data Warehouse
Development and Implementation
Structured data
Data Warehouse and BI Life
Cycle
Business Intelligence
Reporting,
Ad-hoc Query
Analysis
OLTP
Data Warehouse
(Enterprise)
Unstructured data
Phase
I
Data Warehouse and BI Life
Cycle
Extract -Transform - Load
Data Integration
Data Warehouse
(Enterprise)
ODS
Source Data
Reporting Tools
Develop
Discover
Deliver
Phase
II
Data Warehouse and BI Life
Cycle
Data Marts
(Departments)
Data Warehouse
(Enterprise)
Multi-Dimensional Cubes
(Subject wise)
Develop
Discover
Business Intelligence
Deliver
Data Warehouse and BI Life
Cycle
Data Warehouse OLAP Cube
Total Sales for
Customer 3,
Product C, for
July.
From: http://oraclezine.blogspot.com/2009/01/data-warehousingand-olap-cube.html, accessed January 20, 2011
Data Integration and the Extraction,
Transformation, and Load (ETL) Process
 Extraction
 reading data from a database
 Transformation
 Converting the extracted data from its previous form
into the form in which it needs to be so that it can be
placed into a data warehouse or simply another
database
 Load
 Putting the data into the data warehouse
Data Warehouse Schema
 Dimensional Modeling
 The Star Schema
Dimension Tables that contain the Dimension for Analysis
 Example: Time, Region, Salesperson, etc.
 Fact Tables that contains the measures and aggregates
 Example: Average sales, total commission, total sales, etc.


The Snowflake Schema

Very similar to Star-schema with a central fact table, but the
dimensions are in hierarchical fashion.
 Example: Listing agent is a part of the listing company, one city
can have multiple zip code etc.
 Reduces the redundant data but can be inefficient for queries
that do not follow patterns.
Steps for DW OLP Design
 Decide on your Information Needs
 I want to know the Total Sales by month, by region and by salesperson
 Decide the sources of data for each information need
 Total Sales is not available, must be calculated from unit and quantity sold
 Most of the unit and qty sold are in OLTP, but some are also in CSV files for some
of the stores not connected to the database
 The Monthly information can be obtained from the date and is in OLTP
 The region information is also in OLTP transaction records vi a region code
 The Salesperson information is maintained in an Excel file and is coded by region
 Declare the grain of the fact table (preferably at the most atomic level)
 A grain is the lowest level of information you are interested in. The finer is the
grain (tending towards atomic), the more are the dimensions.
 Atomic grain is preferred as it can allow easy roll-ups. However, it can take a lot
of space and processing.
 Often the grains in the fact tables are much coarser with ability to drill down
 Example of Grains


Sales of Each Customer, for each day for each store.
Monthly total sales for a particular region for all female customers
 Add dimensions for "everything you know" about this grain
 Add numeric measured facts true to the grain.
Creating a Star* Schema
 Identify the dimensions (typically this is the analysis by). In
our real estate listing example, it can be the city name,
bedrooms, listing agent, etc.
 Identify the measures. In our case, it can be the average
price/sq ft, total number of houses in a city, the average price,
etc.
 Identify the attributes of each dimension. Attributes are the
properties of a dimension. For example, if the listing agent is
a dimension, then the first name, last name, phone number
etc, of a listing agent will be the attributes.
* Snowflake schema design is very similar with hierarchy of the
dimensions separated in another table.
Star Schema continued.
 Create the dimension tables with a surrogate Primary
Key (PK). Include all the necessary attributes.
 Decide on the measures and calculations. Those will
be in the fact table.
 For each PK in the dimension tables, create foreign
keys (FK) in the star table.
Example of Star Schema
Example of Snowflake Schema
Implementing the Star Schema
1. Extract Data From Multiple Sources
2. Integrate, Transform, and Restructure Data
3. Load Data Into Dimension Tables and Fact Tables
The Star Schema Data Load
Heterogeneous
Data Sources
Data Warehouse
Staging Area
Northwind
OLTP
Financial
External
Internal
Files
Files
Sales
Star
DTS
DTS
External
Files
Extracting Data From DTS Transforming DTS
Heterogeneous Sources
Data
Inventory
Star
Loading the
Star Schema
Verifying the Dimension Source Data
Verifying Accuracy of Source
Data
 Integrating data from multiple
sources
 Applying business rules
 Checking structural requirements
Correcting Invalid Data
Managing Invalid Data

Transforming data

Rejecting invalid data

Reassigning data values

Saving invalid data to a log
Dimension Data Load Examples:
buyer_name reg_id
Barr, Adam
2
Chai, Sean
4
O’Melia, Erin
6
...
...
buyer_code buyer_last reg_id
Barr
2
A123
Chai
4
B456
O’Melia
6
...
...
...
buyer_name
Barr, Adam
Chai, Sean
reg_id
II
IV
buyer_name
Smith, Jane
Paper, Anne
reg_id
2
4
DTS
DTS
DTS
buyer_first
Adam
Sean
Erin
...
buyer_code
U999
A123
B456
...
buyer_last reg_id
Barr
2
Chai
4
O’Melia
6
...
...
buyer_last reg_id
Barr
2
Chai
4
O’Melia
6
...
...
buyer_name
Barr, Adam
Chai, Sean
Smith, Jane
Paper, Anne
reg_id
2
4
2
4
Maintaining Integrity of the Dimension
 Assigning a Surrogate Key to Each Record
Defines the dimension’s primary key
 Relates to the foreign key fields of the fact table

 Loading One Record Per Application Key
Maintains uniqueness in the dimension
 Depends on how you manage changing dimension
data
 Maintains integrity of the fact table

Managing Changing Dimension Data
 One of the problem in DW is to deal with the changing
data on the reload and refresh. OLTP is operational
and thus does not face this problem. These change
areas are often called SCD or Slow Changing
Dimensions
 Dimensions with Changing Column Values


Inserts of new data
Updates of existing data
 Slowly-Changing Dimension Design Solutions
 Type 1: Overwrite the dimension record
 Type 2: Write another dimension record
 Type 3: Add attributes to the dimension record
Examining the Star Schema
Sales Star Schema
Fact Table
Dimension
Tables
Dimension Table
Type 1: Overwriting the Dimension Slide
Product Dimension
product key
product name
product size
product package
product dept
product cat
product subcat
...
Before
001
Rice Puffs
10 oz.
Bag
Grocery
Dry Goods
Snacks
...
After
001
Rice Puffs
12 Oz
oz.
Bag
Grocery
Dry Goods
Snacks
...
Existing record
is changed
Type 2: Writing Another Dimension Record
Product Dimension
product key
product name
product size
product package
product dept
product cat
product subcat
effective_date
…
Before
001
Rice Puffs
10 oz.
Bag
Grocery
Dry Goods
Snacks
05-01-1995
...
001
Rice Puffs
10 Oz
oz.
Bag
Grocery
Dry Goods
Snacks
05-01-1995
...
After
731
Rice Puffs
12 Oz
oz.
Bag
Grocery
Dry Goods
Snacks
10-15-1998
...
Adds a new record
Type 3: Adding Attributes in the Dimension
Record
Product Dimension
product key
product name
product
productsize
size
product package
product dept
product cat
product subcat
current product size date
previous
product
previous
product
size size
previous
product
size size
date date
previous
product
2nd previous product size
2nd previous product size date
...
Before
001
Rice Puffs
10
10 Oz
oz.
Bag
Grocery
Dry Goods
Snacks
05-01-1995
11
11 Oz
oz.
03-20-1994
03-20-1994
(null)
(null)
...
After
001
Rice Puffs
12
12 oz.
oz
Bag
Grocery
Dry Goods
Snacks
10-15-1998
10-15-1998
10 oz.
05-01-1995
05-01-1995
11
11Oz
oz.
03-20-1994
03-20-1994
...
Additional information is stored
in an existing record
Verifying the Fact Table Source Data
Verifying Accuracy of Source Data

Integrating data from multiple sources

Applying business rules

Checking structural requirements

Creating calculated fields
Correcting Invalid Data
Managing Invalid Data

Transforming data

Rejecting invalid data

Reassigning data values

Saving invalid data to a log
Assigning Foreign Keys
Dimension
Tables
customer_dim
201 ALFI Alfreds
Source Data
customer id
ALFI
product id order date quantity_sales amount_sales
123
1/1/2000
400
10,789
product_dim
25 123 Chai
Sales Fact Data
cust_key
time_dim
134 1/1/2000
201
prod_key time_key
123
25
1/1/2000
134
quantity_sales amount_sales
400
10,789
Defining Measures
 Loading Measures
from the Source
System
 Calculating
Additional
Measures
customer_id
VINET
ALFI
HANAR
...
product_id
9GZ
1KJ
0ZA
...
price
.55
1.10
.98
...
qty
32
48
9
...
Source System Data
customer_key
100
238
437
...
product_key
512
207
338
...
qty
32
48
9
...
Fact Table Data
total_sales
17.60
52.80
8.82
...
Maintaining Data Integrity
 Adhering to the Fact Table Grain
 A fact table can only have one grain
 You must load a fact table with data at the same level of detail
as defined by the grain
 Enforcing Column Constraints
 NOT NULL constraints
 FOREIGN KEY constraints
Implementing Staging Tables
 Centralize and Integrate Source Data
 Break Up Complex Data Transformations
 Facilitate Error Recovery
market_stage
shipments_stage
Staging Area
sales_stage
inventory_stage
DTS Functionality
 Accessing Heterogeneous Data Sources
 Importing, Exporting, and Transforming Data
 Creating Reusable Transformations and Functions
 Automating Data Loads
 Managing Metadata
 Customizing and Extending Functionality
Defining DTS Packages
 Identifies Data Sources and Destinations
 Defines Tasks or Actions
 Implements Transformation Logic
 Defines Order of Operations
Identifying Package Components
 Connections: Access Data Sources and Destinations
 Tasks: Describe Data Transformations or Functions
 Steps: Define the Order of Task Operations or
Workflow
 Global Variables: Store Data that Can Be Shared
Across Tasks
Creating Packages
 Using the DTS Import / Export Wizard
 Perform ad-hoc table and data transfers
 Develop a prototype package
 Using DTS Package Designer
 Edit packages created with the DTS Import/Export Wizard
 Create packages with a wide range of functionality
 Programming DTS Applications
 Directly access the functionality of the DTS Object Model
 Requires Microsoft Visual Basic or Microsoft Visual C++
Using DTS to Populate the Sales Star
 Populating the Sales Star Dimensions
 Populating the Sales Star Fact Table
Populating the Sales Star Dimensions
Product
Tab Delimited
Files
product_dim
DTS
customer_dim
Northwind
OLTP
SQL Server
Stored Procedure
DTS
time_dim
DTS
Populating the Sales Star Fact Table
Sales Data
File
product_dim
DTS
sales_stage
DTS
customer_dim
sales_stage
time_dim
sales_fact
Why not Just Excel Pivot Tables?
 For small project, Pivot Tables are excellent
 However, if you insert/delete records, you have to refresh the




pivot table and may have to change the table reference
(though Table features make it easy)
Calculation cannot be done for one field that will be reflected
for all records. You have to copy it for all the records!
No pre-processing can be done. The pivot table is refreshed
and recalculated every time. Can lead to inefficiencies
No built-in integration mechanism with a data source (can be
done by connecting to cube servers and ODBC compliant
databases, but again an extra step and work)
Selective drill-down is difficult, mainly all or none (again
grouping can help, but you have to be skilled in Pivot Tables)
Data Warehouse Implementation
 Implementation issues
 Implementing a data warehouse is generally a
massive effort that must be planned and executed
according to established methods
 There are many facets to the project lifecycle, and
no single person can be an expert in each area
Major Tasks for Successful Implementation of a DW
1. Establishment of
2.
3.
4.
5.
service-level
agreements and datarefresh requirements
Identification of data
sources and their
governance policies
Data quality planning
Data model design
ETL tool selection
6. Relational database
7.
8.
9.
10.
11.
software and platform
selection
Data transport
Data conversion
Reconciliation process
Purge and archive
planning
End-user support
From Solomon, 2005
Data Warehouse Implementation
 Implementation factors that can be categorized into
three criteria



Organizational issues
Project issues
Technical issues
 User participation in the development of data and
access modeling is a critical success factor in data
warehouse development
Data Warehouse Management
 Data warehouse administrator (DWA)
A
person responsible for the administration
and management of a data warehouse
 Security Concerns
 “All
eggs in one basket”
 Unauthorized access
 Usage Concerns
 Users
need to understand how to use the data
correctly
 Why was data collected, how was it stored, how
do these affect the current use of the data
Data Warehouse Management
 Effective security in a data warehouse should focus
on four main areas:
Establishing effective corporate and security policies and
procedures
 Implementing logical security procedures and techniques
to restrict access
 Limiting physical access to the data center environment
 Establishing an effective internal control review process
with an emphasis on security and privacy

Data Warehouse Benefits
 Direct benefits of a data warehouse
 Allows end users to perform extensive analysis
 Allows a consolidated view of corporate data
 Better and more timely information
 Enhanced system performance
 Simplification of data access
Data Warehouse Benefits
 Indirect benefits result from end users using these
direct benefits





Enhance business knowledge
Present competitive advantage
Enhance customer service and satisfaction
Facilitate decision making
Help in reforming business processes
Data Warehouse Vendor Selection
 Important guidelines for selecting a DW vendor
 Financial strength
 ERP linkages
 Qualified consultants
 Market share
 Industry experience
 Established partnerships
DW Development Best Practices
 Project must fit with
 The project must be





corporate strategy and
business objectives
There must be complete
buy-in to the project by
executives, managers, and
users
It is important to manage
user expectations about the
completed project
The data warehouse must
be built incrementally
Build in adaptability



managed by both IT and
business professionals
Develop a
business/supplier
relationship
Only load data that have
been cleansed and are of a
quality understood by the
organization
Do not overlook training
requirements
Be politically aware
From Weir, 2002
DW Development: What NOT to do










Cultural issues being ignored
Inappropriate architecture
Unclear business objectives
Missing information
Unrealistic expectations
Low levels of data
summarization
Low data quality
Believing promises of
performance, capacity, and
scalability
Believing that your problems
are over when the data
warehouse is up and running
Believing that DW database
design is the same as
transactional database design
 Choosing a DW manager who
is technology oriented rather
than user oriented
 Focusing on traditional
internal record-oriented data
and ignoring the value of
external data and of text,
images, and, perhaps, sound
and video
 Focusing on ad hoc data
mining and periodic
reporting instead of alerts
 Delivering data with
overlapping and confusing
definitions
Download