Uploaded by Halu Halu

chp5 DW Planing

advertisement
REQUIREMENTS AS THE DRIVING
FORCE FOR DATA WAREHOUSING
CHAPTER OBJECTIVES
 Understand why business requirements are the driving
force
 Discuss how requirements drive every development
phase
 Specifically learn how requirements influence data
design
 Review the impact of requirements on architecture
 Note the special considerations for ETL and metadata
 Examine how requirements shape information delivery
DEFINING THE BUSINES REQUIREMENTS
 In several ways, building a data warehouse is very different
from building an operational system.
 This becomes notable especially in the requirements
gathering phase.
 Because of this difference, the traditional methods of
collecting requirements that work well for operational
systems cannot be applied to data warehouses.
Dimensional Nature of Business Data
 In data warehousing system, the users are generally unable to
define their requirements clearly.
 users cannot define precisely what information they really
want from the data warehouse, nor can they express how
they would like to use the information or process it.
 Managers think of the business in terms of business
dimensions.
 If your users of the data warehouse think in terms of business
dimensions for decision making, you should also think of
business dimensions while collecting requirements.
Dimensional Nature of Business Data
Figure 5-2 shows the analysis of sales units along the three business dimensions of product,
time, and geography.
Examples of business dimensions
information package diagrams.
 Information Packages – novel idea for determining and
recording information requirements for a data warehouse.
 New methodology for determining requirements for a data
warehouse system based on business dimensions.
 Using the new methodology, you come up with the
measurements and the relevant dimensions that must be captured
and kept in the data warehouse.You come up with what is known
as an information package for the specific subject.
Example :information package for analyzing sales for
a certain business.
Example :information package for analyzing
sales for a certain business
 The subject here is sales. The measured facts or the measurements that
are of interest for analysis are shown in the bottom section of the
package diagram. In this case, the measurements are actual sales,
forecast sales, and budget sales. The business dimensions along which
these measurements are to be analyzed are shown at the top of diagram
as column headings. In our example, these dimensions are time,
location, product, and demographic age group. Each of these business
dimensions contains a hierarchy or levels. For example, the time
dimension has the hierarchy going from year down to the level of
individual day. The other intermediary levels in the time dimension
could be quarter, month, and week. These levels or hierarchical
components are shown in the information package diagram.
IPD enables you to….
 Define the common subject areas
 Design key business metrics
 Decide how data must be presented
 Determine how users will aggregate or roll up
 Decide the data quantity for user analysis or query
 Decide how data will be accessed
 Establish data granularity
 Estimate data warehouse size
 Determine the frequency for data refreshing
 Ascertain how information must be packaged
Dimension Hierarchies/Categories
 When a user analyzes the measurements along a business dimension, the





user usually would like to see the numbers first in summary and then at
various levels of detail.
What the user does here is to traverse the hierarchical levels of a business
dimension for getting the details at various levels.
the hierarchy of the time dimension consists of the levels of year, quarter,
and month.
The dimension hierarchies are the paths for drilling down or rolling up in
our analysis.
Within each major business dimension there are categories of data
elements that can also be useful for analysis.
Hierarchies and categories are included in the information packages for
each dimension.
Business dimensions for
automobile manufacturer
Structure for Key Measurements
 Key measurements are the measures that are used for
business analysis and monitoring. Users measure
performance by using and comparing key measurements.
 For example, if an information package diagram has product,
customer, time, and location as the business dimensions,
these four dimensions will be four distinct components in the
structure of the data model.
Fundamental principle
Figure 6-1 Business requirements as the driving force.
DATA DESIGN
 In the data design phase you come up with the data model for
the following data repositories:
 The staging area where you transform, cleanse, and integrate
the data from the source systems in preparation for loading into
the data warehouse repository.
 The data warehouse repository itself.
Figure 6-2 Requirements driving the data model.
Figure 6-2. The base of the pyramid represents the data model for the enterprise-wide data repository
and the top half of the pyramid denotes the dimensional data model for the data marts. What do you
need in the requirements definition to build the two halves of the pyramid? Two basic pieces of
information are needed: the source system data models and the information package
diagrams.
 If you are adopting the practical approach of building your data
warehouse as a conglomeration of conformed data marts, your
data model at this point will consist of the dimensional data
model for your first set of data marts.
 On the other hand, your company may decide to build the large
corporate-wide data warehouse first along with the initial data
mart fed by the large data warehouse.
 In this case, your data model will include both the data model
for the large data warehouse and the data model for the initial
data mart.
THE ARCHITECTURAL PLAN
 Planning the architecture:
1) Involves reviewing each of the major architectural components
2) Involves the interfaces among the various components
3) How can the management and control module be designed to
coordinate and control the functions of the different components?
4) What is the information you need to do the planning?
5) How will you know to size up each component and provide the
appropriate infrastructure to support it?
Of course, the answer is business requirements. All the
information you need to plan the architecture must come from
the requirements definition.
Composition of the Components
Let us review each component and ascertain what exactly is needed in the requirements definition to
plan for the data warehouse architecture.
In the following list, the points under each component indicate the type of information that must be
contained in the requirements definition to drive the architectural plan.
Figure 6-4 provides a useful summary of the architectural components driven by requirements. The
figure indicates the impact of business requirements on the data warehouse architecture.
Figure 6-4 Impact of requirements on architecture
Special Considerations
 Having reviewed the impact of requirements on the
architectural components in some detail, we now
turn our attention to a few functions that deserve
special consideration. We need to bring out these
special considerations because if these are missed in
the requirements definition, serious consequences
will occur. When you are in the requirements
definition phase, you have to pay special attention to
these factors.
Data Extraction/Transformation/Loading (ETL).
 Data Extraction:
 Clearly identify all the internal data sources. Specify all the
computing platforms and source files from which the data is to
be extracted.
 If you are going to include external data sources, determine the
compatibility of your data structures with those of the outside
sources.
 Also indicate the methods for data extraction.
Data Extraction/Transformation/Loading (ETL).
 Data Transformation.
 Many types of transformation functions are needed before data
can be mapped and prepared for loading into the data warehouse
repository.
 functions include input selection, separation of input structures,
normalization and denormalization of source structures,and
conversions of names and addresses.
 this turns out to be a long and complex list of functions.
 Examine each data element planned to be stored in the data
warehouse.
Data Extraction/Transformation/Loading (ETL).
 Data Loading.
 Define the initial load.
 Determine how often each major group of data must be kept
up-to-date in the data warehouse.
 How much of the updates will be nightly updates?
 Does your environment warrant more than one update cycle in
a day?
 How are the changes going to be captured in the source
systems?
 Define how the daily, weekly, and monthly updates will be
initiated and carried out.
Data Quality
 Bad data leads to bad decisions.
 if the data quality of your data warehouse is suspect, the users




will quickly lose confidence and flee the data warehouse.
Data quality in a data warehouse is sacrosanct.
Therefore, right in the early phase of requirements
definition, identify potential sources of data pollution in the
source systems.
Also, be aware of all the possible types of data quality
problems likely to be encountered in your operational
systems.
Data Pollution Sources
 System conversions and migrations
 Heterogeneous systems integration
 Inadequate database design of source systems
 Data aging
 Incomplete information from customers
 Input errors
 Internationalization/localization of systems
 Lack of data management policies/procedures
Types of Data Quality Problems
 Dummy values in source system fields
 Absence of data in source system fields
 Multipurpose fields
 Cryptic data
 Contradicting data
 Improper use of name and address lines
 Violation of business rules
 Reused primary keys
 Nonunique identifiers
Metadata.
 Metadata in a data warehouse is much more than details that can
be carried in a data dictionary or data catalog.
 Metadata acts as a glue to tie all the components together.
 When data moves from one component to another, that
movement is governed by the relevant portion of metadata.
 When a user queries the data warehouse, metadata acts as the
information resource to connect the query parameters with the
database components.
we had categorized the metadata in a data warehouse into three groups:
operational, data extraction and transformation, and end-user.
Figure 6-5 displays the impact of business requirements on the metadata
architectural component.
Figure 6-5 Impact of requirements on metadata.
DATA STORAGE SPECIFICATIONS
 If your company is adopting the top-down approach , define the
storage specifications for:
 The data staging area
 The overall corporate data warehouse
 Each of the dependent data marts, beginning with the first
 Any multidimensional databases for OLAP
 if your company adopting the bottom-up approach, you need
specifications for:
 The data staging area
 Each of the conformed data marts, beginning with the first
 Any multidimensional databases for OLAP
DBMS Selection
 Whatever your choice of the database management system may be





that system will have to interact with back-end and front-end tools.
The back-end tools are the products for data transformation, data
cleansing, and data loading.
The front-end tools relate to information delivery to the users.
If you are trying to find the best tools to suit your environment, the
chances are these tools may not be from the same vendors who
supplied the database products.
Therefore, one important criterion for the database
management system is that the system must be open. It
must be compatible with the chosen back-end and frontend tools.
Broadly, the following elements of business requirements affect the
choice of the DBMS:
The following elements of business requirements affect the
choice of the DBMS:
Level of User Experience. If the users are totally inexperienced with database systems, the DBMS
must have features to monitor and control runaway queries. If many of your users are power users, then
they will be formulating their own queries.
Types of Queries. The DBMS must have a powerful optimizer if most of the queries are complex and
produce large result sets. Alternatively, if there is an even mix of simple and complex queries, there
must be some sort of query management in the database software to balance the query execution.
Data Loads. The data volumes and load frequencies determine the strengths in the areas of data
loading, recovery, and restart.
Metadata Management. If your metadata component does not have to be elaborate, then a DBMS
with an active data dictionary may be sufficient.
Data Repository Locations. Is your data warehouse going to reside in one central location, or is it
going to be distributed? The answer to this question will establish whether the selected DBMS must
support distributed databases.
Data Warehouse Growth. Your business requirements definition must contain information on the
estimated growth in the number of users, and in the number and complexity of queries.
Storage Sizing
 How big will your data warehouse be? How much storage will be needed for all
the data repositories? What is the total storage size? Answers to these questions
will impact the type and size of the storage medium. How do you find answers to
these questions?
 You need to estimate the storage sizes for the following in the requirements
definition phase:
 Data Staging Area. Calculate storage estimates for the data staging area of the
overall corporate data warehouse from the sizes of the source system data
structures for each business subject.
 Overall Corporate Data Warehouse. Estimate the storage size based on the data
structures for each business subject. You know that data in the data warehouse is
stored by business subjects. For each business subject, list the various attributes,
estimate their field lengths, and arrive at the calculation for the storage needed
for that subject.
 Data Marts—Conformed, Independent, Dependent, or Federated.
While defining requirements, you create information diagrams. A set of
these diagrams constitutes a data mart. Each information diagram
contains business dimensions and their attributes. The information
diagram also holds the metrics or business measurements that are meant
for analysis. Use the details of the business dimensions and business
measures found in the information diagrams to estimate the storage size
for the data marts.
 Multidimensional Databases. These databases support OLAP or
multidimensional analysis.
 How much online analytical processing (OLAP) is necessary for your
users? The corporate data warehouse or the individual conformed or
dependent data mart supplies the data for the multidimensional
databases.
INFORMATION DELIVERY STRATEGY
 The impact of business requirements on the information delivery
mechanism in a data warehouse is straightforward. During the
requirements definition phase, users tell you what information
they want to retrieve from the data warehouse. .
 The broad areas of the information delivery component directly
impacted by business requirements are:
 Queries and reports
 Types of analysis
 Information distribution
 Real time information delivery
 Decision support applications
 Growth and expansion
Queries and Reports
 Find out who will be using predefined queries and preformatted
reports.
 Get the specifications for the production and distribution
frequency for the reports. How many users will be running the
predefined queries?
 The second type of queries is the users formulate their own
queries and they themselves run the queries.
 The set of reports in which the users supply the report
parameters and print fairly sophisticated reports themselves.
 Get as many details of this type of queries and this type of
report sets as you can.
Types of Analysis
 Most data warehouse and business intelligence environments
provide several features to run interactive sessions and perform
complex data analysis. Analysis encompassing drill-down and
roll-up methods is fairly common. Review with your users all the
types of analysis they would like to perform. Get information on
the anticipated complexity of the types of analysis.
 In addition to the analysis performed directly on the data marts,
most of today’s data warehouse and business intelligence
environments equip users with OLAP.
 Using the OLAP facilities, users can perform multidimensional
analysis and obtain multiple views of the data from
multidimensional databases. This type of analysis is called slicing
and dicing.
CHAPTER SUMMARY
 Accurate requirements definition in a data warehouse project is many times more important than in
other types of projects. Clearly understand the impact of business requirements on every
development phase.
 Business requirements condition the outcome of the data design phase.
 Every component of the data warehouse architecture is strongly influenced by the business
requirements.
 In order to provide data quality, identify the data pollution sources, the prevalent types of quality
problems, and the means to eliminate data corruption early in the requirements definition phase
itself.
 Data storage specifications, especially the selection of the DBMS, are determined by business
requirements. Make sure you collect enough relevant details during the requirements phase.
 Business requirements strongly influence the information delivery mechanism.
 Requirements define how, when, and where the users will receive information from the data
warehouse.
Download