DW Concepts - Computer Science

advertisement
Data Warehousing
Concepts
Dr. Awad Khalil
Computer Science Department
AUC
Object_Oriented Databases, by Dr.
Khalil
1
Content
 Why
Data Warehousing?
 What Data Warehousing?
 Benefits of Data Warehousing
 Problems of Data Warehousing
Object_Oriented Databases, by Dr.
Khalil
2
Why Data Warehousing?




Corporate decision-makers require access to all the organization’s
data, wherever it is located.
To provide comprehensive analysis of the organization, its
business, its requirements, and any trends, requires access to not
only the current values in the database, but also to historical data.
To facilitate this type of analysis, the data warehouse has been
created to hold data drawn from several data sources, maintained
by different operating units, together with historical and summary
transformations.
However, decision-makers also require powerful analysis tools.
Two main types of analysis tools have emerged over the last few
years:
 Online Analytical Processing (OLAP)
 Data Mining
Object_Oriented Databases, by Dr.
Khalil
3
What Data Warehousing?

Data Warehousing: A subject-oriented, integrated, timevariant, and non-volatile collection of data in support of
management’s decision-making process.




Subject-oriented: as the warehouse is organized around the
major subjects of the enterprise (such as customers, products,
and sales) rather than the major application areas (such as
customer invoicing, stock control, and product sales).
Integrated: because of the coming together of source data
from different enterprise-wide applications systems.
Time-variant: because data in the warehouse is only accurate
and valid at some point in time or over some time interval.
Non-volatile: as the data is not updated in real time but
refreshed from operational systems on a regular basis.
Object_Oriented Databases, by Dr.
Khalil
4
Benefits of Data Warehousing
Potential
high returns on investment
Competitive advantage
Increase productivity of corporate
decision-makers
Object_Oriented Databases, by Dr.
Khalil
5
OLTP versus Data Warehousing
Data Warehousing
OLTP Systems










Holds current data
Stores detailed data
Data is dynamic
Repetitive processing
High level of transaction
throughput
Predictable pattern of usage
Transaction-driven
Application-oriented
Supports day-to-day
decisions
Serve large number of users

Holds historical data

Stores detailed, lightly, and
highly summarized data

Data is largely static

Ad hoc, unstructured and
heuristic processing

Medium to low level of
transaction throughput

Unpredictable pattern of usage

Analysis-driven

Subject-oriented
Supports strategic decisions
Serves relatively low number
of managerial users


Object_Oriented Databases, by Dr.
Khalil
6










Problems of Data Warehousing
Underestimation of resources for data loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long-duration projects
Complexity of integration
Object_Oriented Databases, by Dr.
Khalil
7
Data Warehouse Architecture










Operational Data
Operational Data
Store (ODS)
Load Manager
Warehouse
Manager
Query Manager
Detailed Data
Lightly and Highly
Summarized Data
Archive/Backup
Data
Metadata
End-User Access
Tools
Object_Oriented Databases, by Dr.
Khalil
8
Data Warehouse Architecture (Cont’d)
Operational Data
 Mainframe operational data held in first generation hierarchical
and network databases.
 Departmental data held in proprietary file systems such as
VSAM, RMS, and relational DBMSs such as Oracle and DB2.
 Private data held on workstations and private servers.

External systems such as the Internet, commercially available
databases, or databases associated with an organization’s
suppliers or customers.
Operational Data Store (ODS)
 ODS is a repository of current and integrated operational data
used for analysis.
 It is often structured and supplied with data in the same way as
the data warehouse , but may in fact act simply as a staging area
for data to be moved into the data warehouse.
Object_Oriented Databases, by Dr.
Khalil
9
Data Warehouse Architecture (Cont’d)
Load Manager
 The load manager (also called the frontend component)
performs all the operations associated with the
extraction and loading of data into the warehouse.
 The operations performed by the load manager may
include simple transformations of the data to prepare
the data for entry into the warehouse.
 The size and complexity of this component will vary
between data warehouses and may be constructed using
a combination of vendor data loading tools and custombuilt programs.
Object_Oriented Databases, by Dr.
Khalil
10
Data Warehouse Architecture (Cont’d)
Warehouse Manager
 The warehouse manager performs all the operations associated with the management
of the data in the warehouse.
 This component is constructed using vendor data management tools and custom-built
programs.
 The operations performed by the warehouse manager include:
 Analysis of data to ensure consistency.
 Transformation and merging of source data from temporary storage into data
warehouse tables.
 Creation of indexes and views on base tables.
 Generation of denormalizations (if necessary).
 Generation of aggregations (if necessary).
 Backing-up and archiving data.
Query Manager
 The query manager (also called the backend component) performs all the operations
associated with the management of user queries.
 This component is typically constructed using vendor end-user data access tools, data
warehouse monitoring tools, database facilities, and custom-built programs.
Object_Oriented Databases, by Dr.
Khalil
11
Data Warehouse Architecture (Cont’d)
Detailed Data
 This area of the warehouse stores all the detailed data in the
database schema.
 In most cases, the detailed data is not stored online but is made
available by aggregating the data to the next level of detail.
 However, on a regular basis, detailed data is added to the
warehouse to supplement the aggregated data.
Lightly and Highly Summarized Data
 This area stores all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
 It is transient area as it will be subject to change on an ongoing
basis in order to respond to changing query profiles.
 The purpose of summary information is to speed up the
performance of the queries.
Object_Oriented Databases, by Dr.
Khalil
12
Data Warehouse Architecture (Cont’d)
Archive/Backup Data
 This area of the warehouse stores the detailed and
summarized data for the purpose of archiving and
backup.
Metadata
 This area stores all the metadata (data about data)
definitions used by all the processes in the warehouse.
Metadata is used for a variety of purposes including:
 The extraction and loading processes.
 The warehouse management process.
 As part of the query management process.
Object_Oriented Databases, by Dr.
Khalil
13
Data Warehouse Architecture (Cont’d)
End-User Access Tools
 The principal purpose of data warehousing is to provide
information to business users for strategic decision-making.
 These users interact with the warehouse using end-user access
tools.
 The data warehouse must efficiently support ad hoc and routine
analysis.
 Although the definition of end-user access tools can overlap,
they can be categorized into five main groups:
 Reporting and query tools;
 Application development tools;
 Executive Information Systems (EIS) tools;
 Online Analytical Processing (OLAP) tools;
 Data mining tools.
Object_Oriented Databases, by Dr.
Khalil
14
Data Warehouse Data Flows

Data warehousing focuses on the
management of five primary data
flows, namely the inflow, upflow,
downflow, outflow, and
metaflow:
 Inflow: Extracting, cleansing,
and loading of the source
data.
 Upflow: Adding value to the
data in the warehouse
through summarizing,
packaging, and distribution
of the data.
 Downflow: Archiving and
backing-up the data in the
warehouse.
 Outflow: Making the data
available to end-users
 Metaflow: Managing the
metadata.
Object_Oriented Databases, by Dr.
Khalil
15
Data Warehouse Tools and Technologies
Extraction, Cleansing, and Transformation Tools
 Selecting the correct extraction, cleansing, and
transformation tools are critical steps in the
construction of a data warehouse.
 The tasks of capturing data from a source system,
cleansing and transforming it, and then loading the
results into target system can be carried out either by
separate products, or by single integrated solution.
 Integrated solutions fall into one of the following
categories:
 Code generators;
 Database replication tools;
 Dynamic transformation engines.
Object_Oriented Databases, by Dr.
Khalil
16
Data Warehouse Tools and Technologies (Cont’d)
Data Warehouse DBMS
 Requirements: The specialized requirements for a relational DBMS suitable for data
warehousing are as follows:
 Load performance and load processing
 Data quality management
 Query performance
 Terabyte scalability and mass user scalability
 Networked data warehouse
 Warehouse administration
 Integrated dimensional analysis
 Advanced query functionality







Parallel DBMSs: Data warehousing requires the processing of enormous amounts of data and
parallel database technology offers a solution to providing the necessary growth in performance.
The success of parallel DBMSs depends on the efficient operation of many resources including
processors, memory, disks and network connections.
The aim behind using parallel DBMS is to solve decision-support problems using multiple nodes
working on the same problem.
The major characteristics of parallel DBMSs are scalability, operability, and availability.
The parallel DBMS performs many database operations simultaneously, splitting individual tasks
into smaller parts so that tasks can be spread across multiple processors.
Parallel DBMSs must be capable of running parallel queries.
Parallel DBMSs must be capable of parallel data loading, table scaling, and data archiving and
backup.
Object_Oriented Databases, by Dr.
Khalil
17
Data Warehouse Tools and Technologies (Cont’d)
Data Warehouse Metadata
The major purpose of metadata is to show the pathway back to where the data begun,
so that the warehouse administrators know the history of any item in the warehouse.
 Metadata has several functions within the warehouse that relates to the processes
associated with data transformation and loading, data warehouse management, and
query generation.
 The metadata associated with data transformation and loading must describe the
source data and any changes that were made to the data.
 The metadata associated with data management describes the data as it is stored in the
warehouse.
 The metadata is also required by the query manager to generate appropriate queries. In
turn, the query manager generates additional metadata about the queries that are run,
which can be used to generate a history on all the queries and a query profile for each
user, group of users, or the data warehouse.
 There is also metadata associated with the users of queries that includes, for example,
information describing what the term ‘price’ or ‘customer’ means in a particular
database and whether the meaning has changed over time.
Synchronizing Metadata
 The major integration issue is how to synchronize the various types of metadata used
throughout the data warehouse.
 The various tools of a data warehouse generate and use their own metadata, and to
achieve integration, we require that these tools are capable of sharing their metadata.
 The challenge is to synchronizeObject_Oriented
metadata between
different
Databases,
by Dr. products from different
18
vendors using different metadata stores. Khalil

Data Warehouse Tools and Technologies (Cont’d)
Administration and Management Tools


A data warehouse requires tools to support the administration and
management of such a complex environment.
The data warehouse administration and management tools must be capable of
supporting the following tasks:
 Monitoring data loading from multiple sources;
 Data quality and integrity checks;
 Managing and updating metadata;
 Monitoring database performance to ensure efficient query response times
and resources utilization;
 Auditing data warehouse usage to provide user chargeback information;
 Replicating, subsetting, and distributing data;
 Maintaining efficient data storage management;
 Purging data;
 Archiving and backing-up data;
 Implementing recovery following failures;
 Security management.
Object_Oriented Databases, by Dr.
Khalil
19
Data Marts


Data Mart is a subset of a data
warehouse that supports the
requirements of a particular
department or business
function.
The characteristics that
differentiate data marts and data
warehouses include:
 A data mart focuses on only
the requirements of users
associated with one
department or business
function;
 Data marts do not normally
contain detailed operational
data, unlike data
warehouse;
 As data marts contain less
data compared with data
warehouse, data marts are
more easily understood and
navigated.
Object_Oriented Databases, by Dr.
Khalil
20
Reasons for Creating a Data Mart







To give users access to the data they need to analyze most often;
To provide data in a form that matches the collective view of the data by a
group of users in a department or business function;
To improve end-user response time due to the reduction in the volume of
data to be accessed;
To provide appropriately structured data as dictated by the requirements of
end-user access tools such as OLAP and data mining tools, which may
require their own internal database structures.
Data marts normally use less data so tasks such as data cleansing, loading,
transformation, and integration are far easier, and hence implementing and
setting up a data mart is simpler than establishing a corporate data
warehouse.
The cost of implementing data marts is normally less than that required to
establish a data warehouse.
The potential users of a data mart are more clearly defined and can be more
easily targeted to obtain support for a data mart project rather than a
corporate data warehouse project.
Object_Oriented Databases, by Dr.
Khalil
21
Data Marts Issues

Data mart functionality: The capabilities of data marts have increased with the growth in
their popularity. Rather than being simply small, easy-to-access databases, some data marts
must now be scalable to hundreds of gigabytes, and provide sophisticated analysis using OLAP
and/or data mining tools.

Data mart size: Users expect faster response times from data marts than from data
warehouse, however, performance deteriorates as data marts grow in size.

Data mart load performance: A data mart has to balance two critical components: enduser response time and data loading performance.

User’s access to data in multiple data marts: One approach is to replicate data
between different data marts or, alternatively, build virtual data marts. Virtual data marts are
views of several physical data marts or the corporate data warehouse tailored to meet the
requirements of specific group of users.

Data mart Internet/Intranet access: Internet/Intranet technology offers users low-cost
access to data marts and the data warehouse using Web browsers.

Data mart administration: As the number of data marts in an organization increases, so
does the need to centrally manage and coordinate data mart activities.

Data mart installation: Data marts are becoming increasingly complex to build. Vendors
are offering products referred to as “Data marts in a box” that provide a low-cost source of data
mart tools.
Object_Oriented Databases, by Dr.
Khalil
22
Thank you
Object_Oriented Databases, by Dr.
Khalil
23
Download