Introduction to Databases - Department of Software and Information

advertisement
Data Warehousing
Dale-Marie Wilson, Ph.D.
Evolution of Data
Warehousing

Since 1970s, organizations gained
competitive advantage
Automated business processes
 More efficient and cost-effective services to
customer


Resulted in accumulation of growing
amounts of data in operational databases
Evolution of Data
Warehousing

Increased focus on ways to use operational data to support decisionmaking


Means of gaining competitive advantage
Operational systems not designed to support such business
activities
 Typically numerous operational systems with overlapping and
contradictory definitions

Organizations need to turn archives of data into source of knowledge


Goal: single integrated / consolidated view of organization’s data
presented to user
Solution: Data Warehouse

Provides system capable of supporting decision-making, receiving
data from multiple operational data sources
Data Warehousing Concepts
A subject-oriented, integrated, time-variant,
and non-volatile collection of data in
support of management’s decision-making
process (Inmon, 1993)
Subject-oriented Data

Warehouse organized around major
subjects of the enterprise e.g. customers,
products, and sales


Not major application areas (e.g. customer
invoicing, stock control, and product sales)
Stores decision-support data not
application-oriented data
Integrated Data

Integrates corporate application-oriented
data from different source systems


Includes inconsistent data
Integrated data source made consistent

Presents unified view of data to users
Time-variant Data

Data accurate and valid at instance in time
or over time interval

Time-variance shown in:
Extended time data held
 Implicit/explicit association of time with data
 Data represents series of snapshots

Non-volatile Data

Data not updated real-time

Refreshed from operational systems on
regular basis

New data added as supplement not
replacement
Data Webhouse

Web is source of behavioral data


Clickstream – user’s path thru Website and
Web history
Data webhouse is a distributed data
warehouse with no central data repository
that is implemented over the Web to
harness clickstream data
Benefits of Data Warehouse

Potential high returns on investment

Competitive advantage

Increased productivity of corporate
decision-makers
Comparison of OLTP
Systems and Data
Warehousing
Data Warehouse Queries

Queries



Range from relatively simple to highly complex
Dependent on end-user access tools used
End-user access tools:




Reporting, query, and application development
tools
Executive information systems (EIS)
OLAP tools
Data mining tools
Examples of Typical Data
Warehouse Queries







What was the total revenue for Scotland in the third quarter of 2004?
What was the total revenue for property sales for each type of property in
Great Britain in 2003?
What are the three most popular areas in each city for the renting of
property in 2004 and how does this compare with the figures for the
previous two years?
What is the monthly revenue for property sales at each branch office,
compared with rolling 12-monthly prior figures?
What would be the effect on property sales in the different regions of
Britain if legal costs went up by 3.5% and Government taxes went down by
1.5% for properties over £100,000?
Which type of property sells for prices above the average selling price for
properties in the main cities of Great Britain and how does this correlate to
demographic data?
What is the relationship between the total annual revenue generated by
each branch office and the total number of sales staff assigned to each
branch office?
Problems of Data
Warehousing

Underestimation of resources for data loading

Hidden problems with source systems

Required data not captured

Increased end-user demands

Data homogenization

High demand for resources

Data ownership

High maintenance

Long duration projects

Complexity of integration
Typical Architecture of
Data Warehouse
Operational Data Resources





Mainframe first generation hierarchical and
network databases
Departmental propriety file systems (e.g. VSAM,
RMS)
Relational DBMSs (e.g. Informix, Oracle)
Private workstations and servers
External systems



Internet
Commercially available databases
Databases associated with organization’s
suppliers or customers
Operational Data Store
(ODS)





Repository of current and integrated operational data used
for analysis
Structured and supplied with data like data warehouse
May act as staging area for data to be moved into
warehouse
Created when legacy operational systems incapable of
achieving reporting requirements
Benefits:
 Provides users with ease-of-use of relational database
 Distant from decision support functions of data warehouse
Load Manager
Performs operations associated with
extraction and loading of data
 Size and complexity varies between data
warehouses
 Constructed using combination of vendor
data loading tools and custom-built
programs

Warehouse Manager
Performs operations associated with
management of data
 Constructed using vendor data
management tools and custom-built
programs

Warehouse Manager



Performs operations associated with management of data
Constructed using vendor data management tools and
custom-built programs
Operations:
 Data analysis to ensure consistency
 Transformation and merging of source data from temporary
storage
 Creation of indexes and views on base tables
 Generation of denormalizations, (if necessary)
 Generation of aggregations, (if necessary)
 Backing-up and archiving data
Warehouse Manager

Generates query profiles to determine which
indexes and aggregations are appropriate

Query profile


Can be generated for each user, group of users,
or the data warehouse
Describes characteristics of queries
• Frequency
• Target table(s)
• Size of results set
Query Manager

Performs operations associated with management of user queries

Constructed using vendor end-user data access tools, data warehouse
monitoring tools, database facilities, and custom-built programs

Complexity determined by facilities provided by end-user access tools
and database

Operations:



Directing queries to appropriate tables
Scheduling execution of queries
Can generate query profiles

Allows warehouse manager to determine appropriate indexes and
aggregations
Detailed Data

Detailed data stored in database schema
Not stored online
 Aggregated to next level of detail


Regularly added to warehouse to
supplement aggregated data
Lightly and Highly
Summarized Data

Stores pre-defined lightly and highly aggregated data
generated by warehouse manager

Transient - changes to respond to changing query profiles

Purpose of summary information
 Improve query performance

Removes requirement to continually perform summary
operations in answering user queries

Summary data updated continuously as new data loaded
into warehouse
Archive/Backup Data

Stores detailed and summarized data for
archiving and backup

Data transferred to storage archives magnetic tape or optical disk
Metadata


Stores metadata (data about data) definitions
used by all processes in warehouse
Used for:

Extraction and loading processes
• Used to map data sources to common view of
information within warehouse

Warehouse management process
• Used to automate production of summary tables

Query management process
• Used to direct query to most appropriate data source
Metadata

Metadata structure differs between processes



Different purposes
Issues:

Multiple copies of metadata describe same data item

Vendor tools and end-user data access use own versions of
metadata

Copy management tools use metadata to understand mapping
rules that are applied to convert source data into common form

End-user access tools use metadata to understand how to build a
query
The management of metadata within data warehouse is very
complex task that should not be underestimated
End-User Access Tools

Principal purpose of data warehousing:

To provide information to business users for strategic decision-making

Users interact with warehouse using end-user access tools

Data warehouse must efficiently support ad hoc and routine analysis

High performance achieved by:




Pre-planning requirements for joins
Summations
Periodic reports by end-users (where possible)
Main groups of access tools





Data reporting and query tools
Application development tools
Executive information system (EIS) tools
Online analytical processing (OLAP) tools
Data mining tools
Data Warehouse
Information Flows
Data Warehouse
Information Flows

Inflow - Processes associated with extraction, cleansing,
and loading data from source systems

Upflow - Processes associated with adding value to data in
warehouse through summarizing, packaging, and
distribution

Downflow - Processes associated with archiving and
backing-up/recovery of data

Outflow - Processes associated with making data available
to end-users

Metaflow - Processes associated with management of
metadata
Data Warehousing Tools
and Technologies

Building data warehouse is complex task

No vendor that provides an ‘end-to-end’
set of tools

Necessitates data warehouse built using
multiple products from different vendors

Major challenge:

Ensuring products work well together and
are fully integrated
Data Warehousing Tools
and Technologies

Tasks of capturing data from source systems,
cleansing and transforming it, and loading
results into target system can be carried out
either by separate products, or by a single
integrated solution

Integrated solutions include



Code Generators
Database Data Replication Tools
Dynamic Transformation Engines
Data Warehouse DBMS
Requirements










Load performance
Load processing
Data quality management
Query performance
Terabyte scalability
Mass user scalability
Networked data warehouse
Warehouse administration
Integrated dimensional analysis
Advanced query functionality
Administration and
Management Tools
Monitoring data loading from multiple
sources
 Data quality and integrity checks
 Managing and updating metadata
 Monitoring database performance to
ensure efficient query response times
and resource utilization
 Auditing data warehouse usage to
provide user chargeback information

Administration and
Management Tools
Replicating, subsetting, and distributing
data
 Maintaining efficient data storage
management
 Purging data
 Archiving and backing-up data
 Implementing recovery following failure
 Security management

Typical Data Warehouse
and Data Mart Architecture
Data Mart

A subset of a data warehouse that
supports the requirements of a particular
department or business function

Characteristics:
Focuses on requirements of one
department or business function
 Does not normally contain detailed
operational data unlike data warehouses
 More easily understood and navigated

Reasons for Creating a
Data Mart

Give users access to data they need to analyze most often

Provide data in form that matches collective view of data by group of
users in a department or business function area

Improve end-user response time

Reduction in volume of data to be accessed

Provide appropriately structured data as dictated by requirements of enduser access tools

Building data mart is simpler compared with establishing corporate data
warehouse

Cost of implementing data marts less than that required to establish data
warehouse

Potential users of data mart more clearly defined

More easily targeted to obtain support for data mart project
Designing Data
Warehouses

Initially, need answers for questions such
as:




Which user requirements are most important and
which data should be considered first?
Which data should be considered first?
Should the project be scaled down into
something more manageable?
Should the infrastructure for a scaled down
project be capable of ultimately delivering a fullscale enterprise-wide data warehouse?
Designing Data
Warehouses

Use of data marts avoids complexities
associated with designing data
 Difficult
to commit to enterprisewide design that must meet all user
requirements
Interim solution => build data marts
 Goal: creation of data warehouse that
supports requirements of enterprise

Designing Data
Warehouses

Requirements collection and analysis stage:
 Involves interviewing appropriate members of staff (such
as marketing users, finance users, and sales users)
• Identify prioritized set of requirements data warehouse must
meet

Interviews conducted with members of staff responsible for
operational systems
• Identify, which data sources can provide clean, valid, and
consistent data that will remain supported over next few years

Interviews provide necessary information for top-down view
(user requirements) and bottom-up view (available data
sources)

Database component of data warehouse described using
technique called dimensionality modeling
Dimensionality Modelling

Logical design technique that aims to present data in
standard, intuitive form that allows for high-performance
access

Uses Entity-Relationship modeling concepts with important
restrictions:
 Every dimensional model (DM) composed of one table with
a composite primary key, called fact table, and set of
smaller tables called dimension tables
 Each dimension table has simple (non-composite) primary
key that corresponds exactly to one component of
composite key in fact table

Forms ‘star-like’ structure called star schema or star join
Dimensionality Modelling

Natural keys replaced with surrogate keys


Every join between fact and dimension
tables based on surrogate keys, not
natural keys
Surrogate key – generalized structure
based on integers

Allows data in warehouse independence
from data used and produced by OLTP
systems
Star schema for property
sales of DreamHome
Dimensionality Modelling

Star schema - logical structure




Has fact table containing factual data in center
Surrounded by dimension tables containing
reference data, which can be denormalized
Facts generated by events that occurred in the
past,
Unlikely to change, regardless of how analyzed
Dimensionality Modelling

Fact tables:
Where bulk of data in data warehouse
 Can be extremely large

Important to treat fact data as read-only
reference data that will not change over
time
 Most useful fact tables contain one or
more numerical measures, or ‘facts’ that
occur for each record and are numeric
and additive

Dimensionality Modelling

Dimension tables:
Usually contain descriptive textual
information
 Dimension attributes used as constraints
in data warehouse queries


Star schemas speeds up query
performance by denormalizing reference
information into single dimension table
Dimensionality Modelling

Snowflake schema


Variant of the star schema where dimension
tables do not contain denormalized data
Starflake schema


Hybrid structure that contains mixture of star
(denormalized) and snowflake (normalized)
schemas
Allows dimensions to be present in both forms to
cater for different query requirements
Property sales with
normalized version of Branch
dimension table
Dimensionality Modelling

Advantages of predictable, standard form
of underlying dimensional model:
Efficiency
 Ability to handle changing requirements

• Star schema handles ad hoc user queries well

Extensibility
• Supports changes e.g. adding new dimension,
facts
Ability to model common business
situations
 Predictable query processing

Comparison of DM and ER
models

ER model


Reduces data redundancy
Beneficial to transaction processing

Single ER model normally decomposes into
multiple DMs

Multiple DMs are associated through ‘shared’
dimension tables
Database Design
Methodology for Data
Warehouses

‘Nine-Step Methodology’:









Choosing the process
Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing pre-calculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query
modes
Step 1: Choosing the
process

The process (function) refers to subject
matter of particular data mart

First data mart built should be:
Most likely to be delivered on time
 Within budget
 Answers the most commercially important
business questions

Business process of
DreamHome case study
Example – Chosen Data
Mart
Step 2: Choosing the grain

Decide what a record of fact table represents

Identify dimensions of fact table

Grain decision for fact table also determines
grain of each dimension table

Include time as core dimension

Always present in star schemas
Step 3: Identifying and
Conforming dimensions

Dimensions set context for asking questions
about the facts in fact table

If any dimension occurs in two data marts:



Must be exactly same dimension
Or one must be mathematical subset of other
Dimension used in more than one data mart
referred to as being conformed
Star schemas for property
sales and property
advertising
Step 4: Choosing the facts

Grain of fact table determines which facts can
be used in data mart

Facts should be numeric and additive

Unusable facts include:



non-numeric facts
non-additive facts
fact at different granularity from other facts in
table
Property rentals with a
badly structured fact table
Property rentals with fact
table corrected
Step 5: Storing precalculations in the fact table

Once facts selected

Re-examine to determine whether there
are opportunities to use pre-calculations
Step 6: Rounding out the
dimension tables
Text descriptions are added to dimension
tables
 Text descriptions should be intuitive and
understandable to users
 Usefulness of data mart determined by
scope and nature of attributes of
dimension tables

Step 7: Choosing the
duration of the database

Duration measures how far back in time fact
table goes

Very large fact tables raises two very significant
data warehouse design issues:


Often difficult to source increasing old data
Mandatory that old versions of important
dimensions be used, not the most current
versions - aka ‘Slowly Changing Dimension’
problem
Step 8: Tracking slowly
changing dimensions

Slowly changing dimension problem


Generalized key assigned to important dimensions


Proper description of old dimension data must be used with old fact data
Allows distinction multiple snapshots of dimensions over period of time
Three basic types of slowly changing dimensions:



Type 1 - where changed dimension attribute overwritten
Type 2 - where changed dimension attribute causes new dimension
record to be created
Type 3 - where a changed dimension attribute causes alternate attribute to
be created
• Both the old and new values of attribute simultaneously accessible in the same
dimension record
Step 9: Deciding the query
priorities and the query
modes

Most critical physical design issues affecting
end-user’s perception includes:



Physical sort order of fact table on disk
Presence of pre-stored summaries or
aggregations
Additional physical design issues:




Administration
Backup
Indexing performance
Security
Database Design
Methodology for Data
Warehouses

Methodology designs data mart:



Supports requirements of particular business
process
Allows easy integration with other related data
marts to form enterprise-wide data warehouse
A dimensional model, which contains more than
one fact table sharing one or more conformed
dimension tables,

Referred to as fact constellation
Fact and dimension tables
for each business process of
DreamHome
Dimensional model (fact
constellation) for the
DreamHome data warehouse

Chapters 31 & 32

Omit material specific to oracle
Download