Data warehouse design issues

advertisement
Data Warehouse :
Design and Lifecycle
N. L. Sarda
Professor, IIT Bombay
nls@cse.iitb.ernet.in
NLS/IITB/DWH
1
Outline
•
•
•
•
•
•
•
Introduction
Warehouse structure
A case study
Lifecycle for development
Dimensional analysis
Technical architecture
Conclusions
NLS/IITB/DWH
2
Introduction
• DW is a single, complete and consistent store of
data from different sources to understand &
analyze the business
• Contains history data
• Typical decision support requires data to be corelated, aggregated in an interactive manner
• Warehouse to facilitate browsing, navigating,
aggregating and visualization of related data to
understand performance, problems, customer
preferences, trends, etc.
NLS/IITB/DWH
3
Introduction...
• Conventional MIS/reporting applications lacked
interactivity and flexibility
• Warehouse data organized by important
business subjects (customer, product, etc…)
NLS/IITB/DWH
4
Warehouse Structure
• Organized to facilitate ease of access and
aggregation
• warehouse structure decomposed into
dimensions and facts
– Dimensions like ‘independent variables’, represent
entities for analysis
– Fact represents business data; relates to a set of
dimensions
– Eg : customer, time, type of account are dimensions,
and balances are facts
NLS/IITB/DWH
5
Warehouse Structure...
• The complex network of business entities and
their relationships as depicted in an operational
DB (using, say, ER model) is difficult for
navigation and analysis
• A ‘2-level’ structure defined by ‘star schema’ is
performed where a fact is at the center and
dimensions form ‘spokes’
• Data not stored in ‘normalized’ form
NLS/IITB/DWH
6
Star Schema
• Contains a fact table and for each dimension one
dimension table
date, custno, prodno, cityname, ...
Time
Cust
NLS/IITB/DWH
f
a
c
t
Prod
City
7
Dimensions
•
•
•
•
Stored as a database table
Contains many descriptive attributes for analysis
Small and slowly changing data
Data often group-able for analysis
– Customers by age, occupation, income level
– Time by weeks, months, years
– Branches as rural, suburban or by size
• Thus, dimension data viewable as a hierarchy
• For analysis, data here joined with facts
NLS/IITB/DWH
8
Dimensions...
• Joins very frequent; efficient access to dimension
(through multiple indexes) and computation of
join required
• Heavily used in constraints and GROUP-BY
NLS/IITB/DWH
9
Facts
• Contain business activity data
• May be at detailed level or status level; called
transaction-oriented or snap-shot oriented
• Deciding on granularity : every sale or total sales
of a day ?
• Often contain numeric attributes for aggregation
(additive, semi-additive,…)
• Contain dimensional table keys also
NLS/IITB/DWH
10
Snowflake Schema
• Hierarchies not captured explicitly in a star
schema
• Snowflake schema represents hierarchy directly
• Saves on storage but requires more join
NLS/IITB/DWH
11
Snowflake Schema
• Represent dimensional hierarchy directly by
normalizing tables.
T
i
m
e
c
u
s
t
NLS/IITB/DWH
p
r
o
d
date, custno, prodno, cityname, ...
f
a
c
t
c
i
t
y
r
e
g
i
o
n
12
Fact Constellation
• Fact Constellation
– Multiple fact tables that share many dimension tables
– Booking and Checkout may share many dimension
tables in the hotel industry
Hotels
Travel Agents
NLS/IITB/DWH
Promotion
Booking
Checkout
Room Type
Customer
13
Data Mart
• A subset of warehouse for use by individuals or
departments
• Contents may be differently structured; may
contain limited history; may be coarser /
aggregated
• Lightens load on central warehouse
• Users primarily use marts with OLAP tools for
analysis and decision support
• refreshed periodically from central warehouse
NLS/IITB/DWH
14
Aggregates
• An aggregate is a fact table representing a
summarization of base-level fact table data
• It is a pre-calculated summaries that are stored
in the data warehouse to improve query
performance
• Aggregates are used for speeding the queries by
a factor of 100 or even 1000
• The IS owners of a data warehouse should
exhaust the potential for using aggregates before
investing in new hardware
NLS/IITB/DWH
15
Warehouse Architecture
• Building a single organization-wide WH that
integrates all data from legacy systems is a very
challenging task
• data marts are subject/dept-wise and easier to
build
• multiple data marts must be relatable and interoperable across depts or business areas
• Kimball proposes DW with a ‘bus architecture’;
he proposes an architecture phase followed by
construction of data marts independently and
asynchronously
NLS/IITB/DWH
16
WH Architecture ...
• As marts come on-line, they fit with each other
properly
• this approach natural in most cases as extraction
of data for WH building is often source-wise and
needs to be done independently
NLS/IITB/DWH
17
Conformed Dimensions and Facts
• Goal is to produce a master suite of conformed
dimensions and to standardize facts
• resulting dimensions and facts for the ‘bus’
• conformed dimension means same thing with
every fact table (eg., customer, time, geography)
• it may contain data brought together from many
sources
• without conformed dimensions, a WH cannot
function as a whole
NLS/IITB/DWH
18
WH Architecture ...
• Getting conformed dimensions represents 80 %
up-front architecture effort
• rest for conformed facts that ensures same
terminology across data marts so that ‘drill
across’ can be done (eg, price, profit)
• ensures same units and meaning, same time
durations and geographies across marts
NLS/IITB/DWH
19
WH Architecture ...
• Advantages of conformed dimensions
– a single dimension table can be used against multiple
fact tables in the same WH
– user interfaces and data content are consistent
whenever the dimension is used
– there is consistent interpretation of attributes and
rollups across marts
– a new data mart can be created such that it can coexist with other
• Use of conformed dimensions must be supported
at the highest executive level
NLS/IITB/DWH
20
Financial Services : A Case Study
• A bank offers various products/services like
saving/checking accounts, mortgage loans,
personal loans, TD, credit cards, etc…
• Purpose : track various a/c, customer profiles,
etc…, for marketing and offering new services
• Requirements:
– Get end-of-month summary of a/c for last 5 years
– Valid snapshot as of yesterday for current month
(with full details)
– Ability to group a/c in various ways & compare
balances
– demographic behavior
NLS/IITB/DWH
21
Case Study ...
• Each account type has some unique attributes
(requiring customized dimension and facts for
each)
• Old data (a/c & customers ) may be incomplete
or even different
• The warehouse data may come from multiple
sources :
–
–
–
–
NLS/IITB/DWH
Loan processing system(customer,loan,dues,payment)
Fixed deposit system(customer,TD,…)
Front-office system(customer, account, transaction,..)
Credit-card system customer, transactions, interest,..)
22
Case Study ...
• Must plan extraction, correlation, consistent
representation,…
• Let us consider a possible warehouse design for
the indicated requirements
• Core fact table : balance in each account, # of
transactions, grain : month
• Dimensions : a/c, household, branch, product,
status, time
• A/c and household separate : many accounts per
family; household definitions change
NLS/IITB/DWH
23
Case Study ...
• Product dimension permits hierarchy and
defining specific attributes; separate because it
changes
• Status : active or not, closed, etc. with reasons
• Account contains customer’s data; for historical
reasons, customer to accounts relationship not
well maintained
NLS/IITB/DWH
24
The household data warehouse
account key
primary_name
secondary_name
account_address
account_city
account_state
account_zip
date_opened
primary_age
primary_sex
primary_marital
household key
household_head_name
household_address
household_city
household_state
household_zip
household_income
household_type
NLS/IITB/DWH
Household Facts
account_key
household_key
branch_key
product_key
status_key
time_key
primary_balance
transaction_count
branch key
branch-name
branch_address
branch_city
branch_state
branch_zip
branch_type
product key
product_description
type
category
status key
status_description
status_reason
new_account_flag
closed_account_flag
time key
month
year
fiscal_quarter
25
Case Study ...
• Balance is semi-additive : can not be added
across time
• Products highly heterogeneous : different
attributes characterize different accounts
(balance, deposit options, interest rate, over draft
limit,..)
• Can’t combine all in a dimension as many not
applicable to all products
NLS/IITB/DWH
26
Case Study ...
• Solution: create many facts, customized for each
product, and one core fact with a product
dimension having common attributes; leads to
100% replication, but facilitates clarifications,
browsing, etc. and avoids join of customized and
core facts
• When many facts are to be stored together go for
snapshot (eg. monthly) snapshots
NLS/IITB/DWH
27
Case Study ...
• Transaction-gained facts usually have a single
fact (eg. amount) that is directly involved in the
transaction; we need a transaction dimension to
represent these amounts
• In transaction grained fact table, we do not need
customized facts tables per product; instead we
create customized dimension tables
NLS/IITB/DWH
28
Data Warehouse Life Cycle
Project
planning
Business
Requirement
Definition
Technical
Architecture
Design
Product
Selection &
Installation
Dimensional
Modeling
Physical
Design
End-User
Application
Specification
Data Staging
Design &
Development
Deployment
Maintenence &
Growth
End-User
Application
Development
Project Management
NLS/IITB/DWH
29
Life Cycle Phases
• Project planning
– Life cycle begins with project planning and addresses
the scoping of the project
– focuses on resource and skill-level, staffing
requirements, project task assignments, and duration
• Business requirements definition
– success of the project depends on the sound
understanding of the business users and their
requirements
– Data warehouse designers must understand the key
factors driving the business requirement and translate
them into design considerations
NLS/IITB/DWH
30
Phases ...
• Dimensional modeling
– Dimensional model is performed by combining data
analysis with our earlier understanding of business
requirements (represented as a matrix)
– this step identifies the fact table grain, associated
dimensions, attributes and hierarchical drill paths,
and facts
• Physical design
– The primary elements in this phase are defining the
naming standards and setting up the database
environment
– It focuses on defining the physical structures
necessary to support the logical database design
NLS/IITB/DWH
31
Phases ...
• Data staging design and development
– The data staging process has three major steps
– Extraction
• It exposes data quality issues within the operational
system
– Transformation
• Consists of data re-structuring and type conversions
(eg., form the EBCDIC character set to ASCII)
– Load
• Load the prepared data into the target tables
NLS/IITB/DWH
32
Phases ...
• Technical Architecture Design
– It specifies the tools and techniques we will need to
make DW happen
• Product Selection and Installation
– Architectural components such as Hardware
platforms, DBMS, and Data staging tools
• End user application specification
– Application specification describe the report template,
user driven parameters, and required calculations.
• End user application Development
NLS/IITB/DWH
33
Phases ...
• Deployment
– It is the convergence of technology, data, and end user
applications accessible from the business user’s
desktop
– Business user education integrating all aspects of the
convergence must be developed and delivered
• Maintenance and growth
– Data warehouse acceptance and performance metrics
should be measured over time and the maintenance
plan should include a communication strategy
– Prioritization processes must be established to deal
with user demands for evolution and growth
NLS/IITB/DWH
34
Phases ...
• Project management
– Project management ensures that the business
dimensional life cycle activities remain on track and
synchronized
– these activities occurs throughout the life cycle
– It focuses on monitoring the project status, issue
tracking, and change control to preserve scope
– It includes the development of a comprehensive
project communication plan that addresses both the
business and information system organization
• Use a good project management tool
NLS/IITB/DWH
35
Life Cycle : summary
• Project planning
• Business requirements definition
• Data track
– Dimensional modeling
– Physical design
– Data staging design and development
• Technology track
– Technical architectural design
– Product selection and and installation
NLS/IITB/DWH
36
Life Cycle...
• Application track
– End user application specification
– End user application development
• Deployment
• Maintenance and growth
• Project management
NLS/IITB/DWH
37
Assess Your Readiness
•
•
•
•
•
Strong business management sponsors
Compelling business motivation
IS/Business partnership
Current analytic culture
Feasibility
NLS/IITB/DWH
38
Core Project Team
•
•
•
•
•
•
Business system analyst
Data modeler
Data warehouse database administrator
Data staging system designer
End user application developers
Data warehouse educator
NLS/IITB/DWH
39
Special Teams
•
•
•
•
•
Technical/security architect
Technical support specialists
Data staging programmer
Data administrator
Data warehouse quality assurance analyst
NLS/IITB/DWH
40
Develop the Project Plan
•
•
•
•
•
•
•
•
•
•
Integrated and detailed
Resources
Original estimated effort
Start date
Original estimated completion date
Current estimated completion date
Status
Effort to complete
Dependencies
Late flags
NLS/IITB/DWH
41
Develop Communication Plan
•
•
•
•
To manage expectations at all levels
within project team : share scope, plans, status
face-to-face communications with sponsors
Business user community : inform what is there
for them : capabilities, limitations, timeframes
• Communication with other interested parties
– Executive management
– IS organization - to enable integration with existing
and proposed systems
– Organization at large
NLS/IITB/DWH
42
Collecting Requirements
•
Dimensional
Modeling
Project
Planning &
Management
Physical
Design
Business
Requirements
Maintenance
and Growth
Data Staging
Design
Deployment
Planning
NLS/IITB/DWH
Technical
Architecture
Design
End-User
Application
Specification
43
Collecting Requirements...
• Interviews/write-ups
• Requirements findings document
–
–
–
–
–
Project overview
review of business objectives
analytic and information requirements
preliminary source systems analysis
Preliminary success criteria
• Prepare and publish the requirements
• Agree on next step after collecting requirements
• Facilitation for conforming and prioritization
NLS/IITB/DWH
44
Collecting Data about Existing Systems
•
•
•
•
Understanding the candidate data sources
Source data ownership
Data providers
Detailed criteria for selecting the data sources
–
–
–
–
Data accessibility
Longevity of the feed
Data accuracy
Project scheduling
• Customer matching and house-holding
• Browsing and data content
• Mapping data from source to target
NLS/IITB/DWH
45
Designing the Data Warehouse /
Data Marts
• Identifying marts and dimensions
• identify marts based on facts likely to be used
together, as a mart is a kind of subject area or
application (divide-and-conquer strategy)
• often based on a single business process or a
single source
• 10 to 30 marts common for a large organization
• build a matrix of marts versus dimensions
NLS/IITB/DWH
46
Designing a Fact
• Choose a data mart : start with single source
data marts
• Define fact grain based on the basic business
facts stored in legacy systems
• Choose dimensions and match them with
granularity of facts
• Combine as many facts as possible with the
context of defined granularity
NLS/IITB/DWH
47
Detailed Design Tips
• Labels which name data marts, dimensions and
attributes should be chosen carefully to refer to
corresponding business entities
• An attribute (in a dimension) is not replicated,
but a fact may be present in many fact tables
• If a dimension occurs multiple times (eg, time), it
is playing multiple roles; name them uniquely
• A single field in the underlying source data can
have one or more logical columns associated with
it (eg, product having code, description, etc)
• Every fact should have a default aggregation rule
so that it is not aggregated wrongly
NLS/IITB/DWH
48
Data Modeling Tool
• The advantages of data modeling tool are
– Integrates the data warehouse model with other
corporate data model
– Helps assure consistency in naming
– Creates good documentation
– Generates physical schema
– Provides a reasonably intuitive user interface for
entering comments about objects
NLS/IITB/DWH
49
Dimensional Modeling
• Strength of dimensional modeling
– It is predictable and standard framework
– It makes the user interfaces more understandable and
processing more efficient
– The predictable frame work of a dimensional model
allows both database systems and end user query tools
to make strong assumptions about the data that aid in
presentation and performance
– It is gracefully extensible to accommodate unexpected
new data elements and new design decisions
– Number of standard approaches for handling
Common modeling situations in the business world
NLS/IITB/DWH
50
Dimension Attributes
• The quality of the data warehouse is measured
by the quality of the dimension attributes
• The user interface responses and final reports
are restricted to the precise contents of the
dimension table attributes
• Properties
– Verbose, descriptive, complete
– Quality assured, indexed
– Equally available, documented
NLS/IITB/DWH
51
Time Dimension
• Every data warehouse fact table is a time series
of some observations
• We always seems to have one or more time
dimensions in our fact table designs
• Provides useful hierarchies : week, month,
quarter, year, etc
• Represents calendar with many useful attributes
like day of week, day of month, week#, day#,
quarter, weekday-flag, last-day-of-month-flag,
holiday flag, etc.
NLS/IITB/DWH
52
Slowly Changing Dimensions
• The production key or customer key does not
change, but the description of the product or
customer does
• The data warehouse has three options for above
changes
– Overwrite the dimension record with the new values,
thereby losing history
• It is used whenever the old value of the attribute has
no significance
• The corrections of any error falls into this category
NLS/IITB/DWH
53
Slowly Changing Dimensions...
– Create a new additional dimension record using a new
value of the surrogate key
• is primary technique for accurately tracking a change
in an attribute within a dimension
• requires use of a surrogate key
• a slowly changing dimension is used when a true
physical change to the dimension entity has taken
place
– Create an “old” field in the dimension record to store
the immediate previous attribute value
• It is used when a change is tentative
NLS/IITB/DWH
54
Time Stamping the Changes
• The design of slowly changing dimension may be
established by adding begin and end time stamps
and a transaction description in each instance of
a dimension record
• This design allows very precise time slicing of the
dimension by itself
NLS/IITB/DWH
55
Large Dimensions
• Data warehouses that store extremely granular
data may require some extremely large
dimensions
• To support large dimensions we must choose the
indexing technologies and data design
approaches that:
– supports rapid browsing of the unconditional
dimension, especially for low cardinality attributes
– Supports efficient browsing of cross-constrained
values in the dimension table
– Find and suppress duplicate entries in the dimension
NLS/IITB/DWH
56
Foreign Key, Primary Key,
Surrogate Key
• All dimensional tables have single keys, which,
by definition, are primary keys
• All data warehouse keys must be meaningless
surrogate keys; you must not use the original
production keys
• A four byte integer makes a good surrogate key
• Surrogate date keys
• Avoid smart keys
• Avoid production keys
NLS/IITB/DWH
57
Heterogeneous Product Schemas
• Multiple fact tables are needed when a business
has heterogeneous products
• The global view needs a single core fact table
crossing all lines of business, whereas local view
focuses on specific product
• There are many attributes and facts which apply
only to a specific product; a single fact table is
not feasible
• create customized fact and (product) dimension
table for each product, and build a core fact
table with attributes that make sense across all
lines of business; this allows to create a single
portfolio (of products) for each customer
NLS/IITB/DWH
58
Transaction Schema
• Every data mart needs two separate models
– Transaction version
– Periodic snapshot version
• ‘rolling’ snapshot containing averages across time
• Snapshots allow us to quickly measure the status of
the enterprise
• The Transaction schema
– low level transactions in the organization makes for a
good dimensional frame work
– The fact record for an individual transaction
frequently contains only a single value
NLS/IITB/DWH
59
Transaction Schema..
• The transaction-based WH commonly used in
–
–
–
–
–
NLS/IITB/DWH
Time of day analysis
Queue analysis
Fraud detection
Basket analysis
Current status
60
Factless Fact Tables
• useful to describe events and their coverage
• an event fact table records occurrence of an
event; has only flag and dimension keys (eg,
student attendance)
• coverage fact table is frequently needed when a
primary fact table in a dimensional data
warehouse is sparse; eg, primary fact table will
not provide items which were on promotion but
did not sale; the coverage table, containing only
dimension keys, lists all items on sale
NLS/IITB/DWH
61
Facts of Different Granularity
• The dimensional model gains power as the
individual fact records become more and more
atomic
• At the lowest level of individual transactions, the
design is most powerful because
– More of the descriptive attributes have single values
– The design withstands surprise in the form of new
facts, new dimensions, or new attributes within
existing dimensions
– More expressiveness at the lowest levels of granularity
NLS/IITB/DWH
62
Technical Architecture
The Back Room
Source
System
Data
Staging
Services
The Front Room
Metadata
Catalog
Presentation Servers
Dimensional Data Marts with
Only Aggregated Data
Data
Staging
Area
Key
NLS/IITB/DWH
Standard
Reporting Tools
Query
Services
Desktop Data
Access Tools
Application
Models
Operational
System
Dimensional Date Marts
Including Atomic Data
Data Element
Service Element
Service Element
63
The Technical Architecture...
It describes flow of data from the source systems
to the decision makers
• Data staging services
–
–
–
–
NLS/IITB/DWH
Extract
Transformation
Load
Job control
• Query services
–
–
–
–
–
Warehouse browsing
Access and security
Query management
Standard reporting
Activity monitor
64
Metadata Catalog
• It is an integral part of the overall architecture
• It contains information that describes the
warehouse and plays an active role in its
creation, use, and maintenance
• Contains source system metadata (data and
processes), data staging metadata (dimensions,
transformations, aggregations), DBMS metadata
(tables, indexes, stored procedures), and frontroom metadata (users, applications)
NLS/IITB/DWH
65
Technical Architecture Features
• Metadata driven
– Metadata provides flexibility by buffering the various
components of the system from each other
– The metadata catalog provides parameters and
information that allow the application to perform
their task
• Flexible services layers
– The data staging services and data query services add
to the flexibility of the architecture
NLS/IITB/DWH
66
Back Room : Data Staging Area
• It is the construction site for the Warehouse
• The central role of the staging area is to evolve
the source system of record for all downstream
DSS and reporting environment
• Data staging data models
– The data models can be designed for performance and
ease for development
– Third normal form often appear in the data staging
area because the source systems are duplicated
NLS/IITB/DWH
67
Data Staging Area...
• Atomic data marts hold the lowest level of
necessary details to meet the most of the high
value business requirements
– Atomic data mart storage type should be relational
rather than OLAP because of extreme level of detail,
the number of dimensions, and size
– Atomic data mart data model built around the
dimensional model, not an ER model
NLS/IITB/DWH
68
Transformation Services
• It is a process of transforming the data from
source systems into something presentable to the
end users and valuable to the business
• Different transformation services :
–
–
–
–
–
–
–
NLS/IITB/DWH
Integration
Slowly Changing dimension maintenance
Referential integrity checking
Data type conversion
Aggregation
Data content audit
Pre- and post-step exits
69
Front Room Architecture
• It is the public face of the warehouse, the
business users see and work with day-to-day
• The presentation servers are machines on which
the data warehouse data is organized for direct
querying by the end users and report writers
• The major types of activities here :
–
–
–
–
–
NLS/IITB/DWH
Warehouse or metadata browsing
Access and Security
Activity monitoring
Query management
Standard reporting
70
Warehouse Browsing
• Using the browsing tools to find and access the
information needed by the user
• The warehouse browser should be dynamically
linked to the metadata catalog
• It should be able to pull the definition and
derivations of the various data elements and to
show a set of standard reports
• Browsing tools
– Visual Basic
– Microsoft Access, etc
NLS/IITB/DWH
71
Access and Security Services
• Access and security services facilitate a user’s
connection to the data base
• It relies on authorization and authentication
services where the user is identified and access
rights are determined or access is refused
• Levels of authentication depends on how
sensitive the data is
NLS/IITB/DWH
72
Activity Monitoring Services
• Capturing the information about the use of the
data warehouse
• The capabilities are :
–
–
–
–
NLS/IITB/DWH
Performance
User support
Marketing
Planning
73
Query Management Services
• Query management services are the set of
capabilities that manage the execution of the
query, and return of the result set to the desktop
• The major query management services are :
–
–
–
–
NLS/IITB/DWH
Query reformulation
Query re-targeting and multi-pass SQL
Aggregate awareness
Query Governing
74
Standard Reporting Services
• It has an ability to create a fixed-format report
requiring limited user interaction, and regular
execution schedules
• Requirements for standard reporting tools are :
–
–
–
–
–
–
–
NLS/IITB/DWH
Reporting developing environment
Report execution server
Time-and event-based scheduling of report execution
Iterative execution
Flexible report definition
Flexible report delivery
Report library with browsing capability
75
Back Room infrastructure factors
• Infrastructure for the data warehouse includes
the hardware, network, and lower-level
functions, such as security etc…
• The data base server is the biggest hardware
platform decision for most data warehouse
projects
NLS/IITB/DWH
76
Back Room Infrastructure Factors...
• The major factors in determining requirements
for the server platforms are :
– Data size
• Most data warehouse/data mart projects tend to start
out with no more than 200 GB
• The data warehouse of less than 100 GB as small,
those from 100 GB as typical, and those with more
than 500 GB to be large
– Volatility
• It measures the dynamic nature of the database; it
includes how often the data base will be updated, how
much data is replaced each time
NLS/IITB/DWH
77
Back Room Infrastructure Factors...
– Number of users
• How active the users are, how many are active
concurrently, and their geographical distribution etc.
are important factors in selecting a platform
– Number of business processes
• It increases the complexity of the data warehouse
• Separate hardware platforms for each business process
– Nature of use
• It depends on the front-end tools, implication on
platform selection, types of queries etc..
NLS/IITB/DWH
78
Technical Factors
• Platforms
– NT servers for medium-sized warehouse
• The NT is cost-effective platform for smaller
warehouses or data marts
– Open system servers
• The open system, or Unix, servers are the primary
platform for most medium-sized or larger warehouse
• If the data warehouse is based on a Unix
environment, the warehouse team will need to know
administrative tools, basic Unix commands and
utilities to be able to develop and manage the
warehouse
NLS/IITB/DWH
79
Technical Factors...
• Disks
– Disk drives can have a major impact on the
performance, flexibility, and scalability of the
warehouse platform
• Memory
– More memory is better for data warehousing
– Transaction requests are small and typically don’t
need much memory, decision support queries requires
more memory and involves large tables
– If the table can fit in memory the performance can
improve 10 to 100 times
NLS/IITB/DWH
80
Technical Factors...
• Database platform
– Data warehouses are implemented using main framebased database products
– Some data warehouses are implemented using a
specialized multidimensional database products called
MOLAP (multidimensional on-line analytical
processing) engines
– MOLAP engines came about in response to three
main user requirements: simple data access, crosstab-style reports, fast response time
– The significant benefit of using a MOLAP engine is
the end user query performance
NLS/IITB/DWH
81
Physical Design
• In the physical design, the data warehouse team
is required to estimate the warehouse’s size
• In data warehouses, the size of dimension tables
is insignificant compared to the size of the fact
tables and the size of the indexes on the fact
tables
NLS/IITB/DWH
82
Initial Sizing Estimates...
• preliminary sizing estimates include
–
–
–
–
–
–
NLS/IITB/DWH
Estimate row length
Estimate number of rows
Count and sizes of indexes
Temp space
space for metadata tables
Considerable space for aggregate tables
83
Indexes and Query Strategies
• To develop an index plan, it’s important to
understand how the RDBMS’s query optimizer
and indexes work
–
–
–
–
–
The B-tree index
The bitmapped index
The hash index
Other index types
Star schema optimization
• Indexing the fact tables, Dimension tables, and
indexing for loads
NLS/IITB/DWH
84
End User Application
Nature
of use
Strategic
Customer
Type
Ad hoc
power
user
Push-button
knowledge
workers
Information
Interface
Value
Desktop tools for
do-it-yourself queries
Migration
path
End User
Application
Reporting/AnalysisExamples
Assured reference points
-Low effort
-Current business view
-Flexible
Migration
path
Standard
report
Operational consumers
NLS/IITB/DWH
Operational
reporting
environment
85
End User Application Template
• It provides the layout and structure of a report
that is driven by a set of parameters
• This approach allows users to generate number
of similar structure reports from a single
template
• Through the drill-down capabilities, a user could
produce reports on other attributes; this action
results in changing the actual template structure
• Many data access tools provide this functionality
transparently
NLS/IITB/DWH
86
Typical Analysis Cycle
•
•
•
•
•
•
•
How’s business?
What are the trends?
What’s unusual?
What is driving those exceptions?
What if…?
Make a business decision
Implement the decision
NLS/IITB/DWH
87
The Desktop Installation Readiness
• The back room architecture and infrastructure
will be established long before deployment as it is
needed for development activities
• The technology residing on user’s desktop is the
last piece that must be put in place prior to the
deployment
NLS/IITB/DWH
88
The Desktop Installation Readiness...
• Check list of activities that should occur well
before the deployment
–
–
–
–
–
–
–
–
–
NLS/IITB/DWH
Determine the client configuration requirement
Determine LAN addresses
Conduct a physical audit
Complete the contract and procurement process
Acquire user logons and security approval
Test installation procedures on a variety of machines
Schedule the installation
Install the desktop hardware and/or software
Complete installation testing
89
End User Education Strategy
• A robust education strategy for business end user
is a prerequisite for data warehouse success
• Integrate and tailor education content
• Education for business users must address three
key aspects of the data warehouse
– Data content
– End user application
– The data access tool
NLS/IITB/DWH
90
The End User Education Strategy…
• Data education content
– provide an overview of structures, hierarchies,
business rules, and definitions
– Before deployment, identify, document, and
communicate these data to the business users
– Factors causing discrepancy between data from the
warehouse and previously reported information are :
• The data warehouse information is incorrect
• The warehouse information has a different or new
business definition or meaning
• The previously reported information was incorrect
NLS/IITB/DWH
91
An End User Support Strategy
• The user support strategies vary by organization
and culture, based largely on the expectations of
senior business management
• Determine the support organization structure
– Centralized team of support resources handles the
more global data warehouse maintenance and
responsibility
– The team typically serves as a second line of defense,
and provides a pool of advanced application
development resources
NLS/IITB/DWH
92
An End User Support Strategy...
• Establish support communication and feedback
– Communication with your user should be minimum,
consisting of general information, and status updates
– Success stories can help motivate
• Provide support documentation
• Create a Warehouse web site
NLS/IITB/DWH
93
Conclusion
• Building a corporate-wide data warehouse is a
challenging task
• A systematic methodology essential
• Plan the architecture globally but build it
incrementally
• Keep user requirements at the core of all
development activities
NLS/IITB/DWH
94
Download