Uploaded by Faiza Saeed

DW-lecture-1-Introduction-to-DWH-13102021-031207pm

advertisement
1
Course Objectives
The main objective of this course is to provide students
with the key knowledge of data warehouse.
Other objectives are as follows:
• Understand concepts and principles of data
Warehouse.
• Reconcile different views of the same data
• Evaluate the fundamental theories and requirements
that influence the design of modern data warehouse
• Critically evaluate alternative designs and architectures
for data warehouses
• Analyze the background processes involved in queries
and transactions, and explain how this impact on
database operation and design.
2
Course Learning Outcomes
Upon completion of the course, Students will be able
to:
• Explain accepted data warehouse terminology
• Explain the goals of data warehousing
• Identify the stages of the data warehousing
lifecycle
• Apply the star schema model to a business case
problem
• Denormalize relational tables into high level
summary tables
• Design and Implement a multi-dimensional data
cube using SQL Server Analysis Services
3
About Theory Course
Course Code:
Course Title:
Credit Hours:
Abbreviation:
Prerequisite:
CSC-454
Data Warehousing
3
DWH
Database Management System –
CSC 220
Type of Course: Elective
Course Description:
Introduction to Data Ware Housing, Normalization, DeNormalization, De-Normalization Techniques, Issues of DeNormalization, Online Analytical Processing (OLAP),Extract Transform
Load (ETL), Data Cleansing, Data Quality Management (DQM), DWH
Lifecycle
4
Course Assessment
Quizzes
10%
Assignments (Theoretical)
20%
Midterm Examination
20%
Final Examination
50%
Total 100%
Asgns
20%
Quizzes
10%
Scoring
Final
50%
Midterm
20%
5
Books
• “Data warehousing fundamentals for
IT professionals", 2nd Ed, by Paulraj
Ponniah. 2010
• “the Data warehouse Toolkit” , latest
edition, by Ralph Kimball and Margy
Ross .2013
• “Mastering Data ware housing
design” by Claudio Imhoff, latest
edition
6
Reference Books
• “Building the data warehouse” by
William H Inmon
• “The Unified Star Schema: An Agile and
Resilient Approach to Data Warehouse
and Analytics Design” by Bill Inmon and
Francesco. 2020
• “The Modern Data Warehouse in Azure:
Building with Speed and Agility on
Microsoft’s Cloud Platform” by Matt
how. 2020
7
Today’s Outline
Why DWH?
What is DWH?
History
Inmon VS Rimball data cycle
Characteristics of DWH
Traditional DWH
Problems of Data warehousing
8
Introduction to Data
Warehousing
9
10
Data, data everywhere yet…
11
Why study Data warehouse?
• New technologies (multidimensional modeling,
business intelligence, OLAP, querying models, etc)
• Research potential (data mining, business
intelligence, ETL algorithms, multidimensional
data analysis, query optimizations, etc)
• Industry demand
• High market value of DW experts
• Fulfill degree requirements
• Easy course, want to sleep through it
12
What is a Data Warehouse?
• A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a what
they can understand and use in
business context.
13
What is a Data Warehouse?
• A process of transforming
data into information and
making it available to users
in a timely enough manner
to make a difference
14
What is a Data Warehouse?
• A Data Warehouse is a relational database
that is designed for query and analysis rather
than for transaction processing.
• It usually contains historical data derived from
transaction data, but it can include data from
other sources.
• It separates analysis workload from
transaction workload and enables an
organization to consolidate data from several
sources.
15
What is DWH
• A collection of small databases.
• A central location where consolidate data from multiple
location(db) are stored
• Strategic information/data is saved in DWH.
–
Eg: (total sale, famous product, non famous product) .
• Note: data warehouse is not loaded every time new data is
added to database.
16
Data WareHouse
• In addition to a relational database, a data
warehouse environment includes:
1. an Extraction, Transformation, and Loading (ETL)
solution.
2. an Online Analytical Processing (OLAP) engine.
3. client analysis tools.
4. other applications that manage the process of
gathering data and delivering it to business users.
17
History of Data Warehousing
18
History of Data Warehousing
19
Definition by founder of DWH
Bill Inmon
• A data warehouse is a
•
subject-oriented,
integrated, time-variant
and non-volatile collection
of data in support of
management’s decision
making process.
Ralph Kimball
a copy of transaction data
specifically structured for
query and analysis”
Dimensional modelling
focuses on ease of end-user
accessibility.
20
21
22
Inmon VS kimball
23
Parameters
Kimball
Inmon
Introduced by
Introduced by Ralph Kimball.
Introduced by Bill Inmon.
Approach
It has Bottom-Up Approach for
implementation.
It has Top-Down Approach for
implementation.
Data Integration
It focuses Individual business areas.
It focuses Enterprise-wide areas.
Building Time
It is efficient and takes less time.
It is complex and consumes a lot
of time.
Cost
It has iterative steps and is cost
effective.
Initial cost is huge and
development cost is low.
Skills Required
It does not need such skills but a generic It needs specialized skills to make
team will do job.
work.
Maintenance
Here maintenance is difficult.
Here maintenance is easy.
Data Model
It prefers data to be in De-normalized
model.
It prefers data to be in normalized
model.
Data Store Systems In this, source systems are highly stable.
In this, source systems have high
rate of change.
24
Characteristics of Data Warehousing
• we must refer the definition of INMON, 1993 for
characteristics.
“A data warehouse is a Subject-Oriented , Integrated ,
TimeVariant and Non-Volatile collection of data in
support of management’s decision-making process.”
• So characteristics of a good data warehouse is:
1. Subject Oriented. (detail info of particular thing)
2. Integrated. (different source system)
3. Time-Variant. (historical data, change over time)
4. Non-Volatile. (unchangedable)
25
1. Subject Oriented:
• Data Warehouse is organized around major subjects of the
enterprise/ specific business areas (e.g. customers, products,
sales) rather than major application areas (e.g. customer
invoicing, stock control, product sales).
• This is reflected in the need to store decision-support data rather
than application-oriented data.
• Example:
– To learn more about your company’s sales data, you can build a warehouse
that concentrates on SALES.
– Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?"
– This ability to define a data warehouse by subject matter, SALES in this
case, makes the data warehouse Subject Oriented.
26
2. Integrated Data:
• The data warehouse integrates corporate
application-oriented data from different
source systems, which often includes data that
is inconsistent.
• The integrated data source must be made
consistent to present a unified view of the
data to the users.
27
3. Nonvolatile:
• Nonvolatile means that, once entered into the
warehouse, data should not change.
• Data in the warehouse is not updated in real-time but
is refreshed from operational systems on a regular
basis.
• New data is always added as a supplement to the
database, rather than a replacement.
• This is logical because the purpose of a warehouse is
to enable you to analyze what has occurred.
28
4. Time Variant:
• In order to discover trends in business, analysts
need large amounts of data.
• This is very much in contrast to Online
Transaction Processing (OLTP) systems, where
performance requirements demand that
historical data be moved to an archive.
• A data warehouse’s focus on change over time
is what is meant by the term time variant
29
Traditional DWH
• Central repository for all internal data
in a company
• Overall relational schema
• The predictable data structure and
quality optimized processing and
reporting
• Data is in disk block formatting
• Fundamental operation is read a row
• Indexing via B-tree
30
Key Trends Breaking traditional DWH
• Big data technology enables IT to leverage multiple sources of data.
Following are some of the sources
31
Problems of Data Warehousing
1. Underestimation of resources for data loading:
– Many developers underestimate the time required to extract,
clean and load the data into the warehouse.
– This process may account for a significant proportion of the
total development time, although better data cleansing and
management tools should ultimately reduce the time and
effort spent.
2. Hidden problems with source systems:
– Hidden problems associated with the source systems feeding
the data warehouse may be identified, possibly after years of
being undetected.
– The developer must decide whether to fix the problem on the
data warehouse and/or fix the source systems.
• For example, when entering the details of a new property, certain
fields may allow null, which may result in staff entering incomplete
property data, even when available and applicable.
32
Problems of Data Warehousing
3. Required data not captured:
– Data Warehouse projects often highlight a requirement for data
not being captured by the existing source systems.
– The organization must decide whether to modify the OLTP
systems or create a system dedicated to capture the missing data.
4. Increased end-user demands:
–
–
After end user receive query and reporting tools, requests for
support from IS (Information System) staff may increase rather
than decrease. This is caused by an increasing awareness of the
users on the capabilities and values of the data warehouse.
This problem can be partially resolved by investigating easier to
use, more powerful tools, or in providing better training for the
users.
33
Problems of Data Warehousing
5. High demand for resources:
– Data warehouse can use large amount of disk space.
– Many relational databases used for decision support are
designed around Star, Snowflake and Starflake schemas.
These approaches result in creation of very large fact tables.
6. High maintenance:
– Data warehouse are high maintenance systems.
– Any reorganization of the business processes and source
systems may affect the data warehouse.
– To remain a valuable resource, data warehouse must remain
consistent with the organization that it supports.
34
Problems of Data Warehousing
7. Long duration projects:
– A data warehouse represents a single data resource for the organization.
– However, the building a warehouse can take up to three years, which is
why some organizations are building the data marts.
– Data marts support only the requirements of a particular department or
functional area and can therefore be built more rapidly.
8. Complexity of integration:
– The most important area for the management of the data warehouse is the
integration capabilities.
– This means an organization must spend a significant amount of time
determining how well the various different data warehousing tools can be
integrated into the overall solution that is needed.
35
Modern Data Wearhousing
36
37
Download