Data warehouse data mining and OLAP

advertisement
Core of Business “Intelligence”
technology
Database warehouse, data mining and
on-line analytical processing
Business Intelligence and Analytics for Decision Support
The diagram show the role played by data warehouse, data-mining
and olap in the “overall” business “decision making” process
Business
intelligence and
analytics requires
a strong database
foundation, a set
of analytic tools,
and an involved
management team
that can ask
intelligent
questions and
analyze data.
Laudon and Laudon
Chapter 10
The Data Warehouse
“A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile collection
of “all” an organisation’s data in support of
management’s decision making process.”
– Data warehouses developed because E.G.:
– if you want to ask “How much does this customer
owe?” then the sales database is probably the one to
use. However if you want to ask “Was this ad
campaign more successful than that one?”, you require
data from more disparate sources Other sources e.g.
production, marketing etc.
Characteristics of a Data Warehouse
• Subject oriented – (based around business
processes; e.g. sale of products,…
• Integrated – inconsistencies removed
• Nonvolatile – stored in read-only format
• Time variant – data is “static” and update
periodically;
• Summarized – in decision-usable format; monthly
average.
• Large volume – data sets are quite large; all the
pertinent data of an organisation
• Non normalized – often redundant: star flake
schema (it has dimension tables and fact tables):
The Atomic Schema
Customer
Customer ID
Status Date
Cust Addr State
Cust ZIP Code
Customer Type
Customer Status
...
Cust Purchases
Customer ID
Activity Date
Product Code
Product Name
Sales Rep ID
Qty Purchased
Total Dollars
Promotion Flag
Product Ref
Product Code
ProdRef Eff. Date
ProdRef End Date
Product Name
Unit Price
Product Category
Product Type
Product Sub Type
Cust Averages
Customer ID
Cust Average Date
Cust Avg. End Date
Cust Avg. Rev.
Cust Longevity
Outlet Reference
Store ID
Store Name
Store Location
Distribution Channel
Sales Rep Ref
Sales Rep ID
Sales Person Name
Store ID
For Example:
Selling Responsibility
Sales Rep ID
Sales Rep Name
Store ID
Store Name
Store Location
Sales Channel
Product
Product Code
Product Name
Prod. Category
Product Type
Prod Sub Type
Customer Location
Cust ZIP Code
Purchases 1
Sales Rep ID
Product Code
Cust ZIP Code
Customer Type
Week Ending Date
Days of Activity
Unit Price
Total Quantity
Total Dollars
Returned Qty
Returned Dollars
Promotion Qty
City
State/Province
Country
Customer Type
Customer Type
Cust Type Desc
Date Information
Week Ending Date
Month
Quarter
Year
Elements of the building of a Data warehousing
infrastructure
Dependent
Data Mart
External
Data
Extract/Summarize Data
ETL Routine
Operational
Database(s)
(Extract/Transform/Load)
Data
Warehouse
Independent
Data Mart
Decision Support System
Report
A data warehouse process model
Meta Data
• A key concept behind D.W. is Meta Data.
– Meta data is data about the data (which has come from
the data sources) and shows what data is contained in
the DW, where it came from, and what changes have
been made to it.
• The metadata are essential ingredients in the transformation of raw
data into knowledge. They are the “keys” that allow us to handle the
raw data.
– For example, a line in a sales database may contain:
1023 K596 111.21
– This is mostly meaningless until we consult the metadata (in the data
directory) that tells us it was store number 1023, product K596 and sales
of $111.21.
Meta Data Answers Questions for Users
of the Data Warehouse
• How do I find the data
I need?
• What is the original
source of the data?
• How was this
summarization
created?
• What queries are
available to access the
data?
 How
have business
definitions and terms
changed over time?
 How do product lines
vary across
organizations?
 What business
assumptions have
been made?
Dependent Data marts
• A data mart is a data store that is subsidiary to a data
warehouse of integrated data.
• The data mart is directed at a partition of data (subject area)
that is created for the use of a dedicated group of users and is
sometimes termed a “subject warehouse”
• The data mart might be a set of denormalised, summarised
or aggregated data that can be placed on the data warehouse
database or more often placed on a separate physical store.
• Data marts can be “dependent data marts” when the data is
sourced from the data warehouse.
• Independent data marts represent fragmented solutions to a
range of business problems in the enterprise, however, such
a concept should not be deployed as it doesn’t have the “data
integration” concept that’s associated with data warehouses.
Independent Data marts
• However, such marts are not necessarly all
bad.
• Often a valid solution to a pressing business
problem:
– Extremely urgent user requirements
– The absence of a budget for a full data
warehouse
– The decentralisation of business units
Data Warehousing Architecture
• Access Tools
– The principal purpose of the data warehouse is
to provide information for strategic decision
making.
– The main Decision tools used to achieve this
objective are:
• Data mining tools
• On-line analytical processing tools
• Decision support systems / Executive information
system tools
Data Warehousing Typology
– THE D.W. can be at single location i.e. a central data
warehouse
– The collection of data is replicated around multiple
locations. This means users have a local copy of the
data warehouse. This can improve query run-times, and
reduce communications overheads. Distributed Data
warehouse (Note: The principles associated with
distributed database equally apply to Distributed Data
warehouses, however, the static nature of the data needs
to be factored in to the design process ) .
Data Warehouse Construction
Tips
• Accept that your first try will require revision
• Examine the data: What formats and specific data are
needed to support your application?
• Clean up the data before using it in the warehouse
• Build a prototype mini-data warehouse as a learning
experience and revise strategies as necessary
• Plan on more users than anticipated wanting to use the
warehouse
• Keep storage requirements constantly in mind
Sample type question
• Discuss how D.W. can play’s key role in
strategic decision making.
Data Mining
• The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large databases and using
it to make crucial business decisions.
• Involves the analysis of data and the use of
software techniques for finding hidden and
unexpected patterns and relationships in sets
of data.
16
Data Mining
• Data mining tools uses ,e.g. AI techniques,
to help:
– predict future trends: ,
– Segment datasets
– “Product” association
• allowing businesses to make proactive,
knowledge-driven decisions.
17
Data mining: A.I. techniques.
• The most commonly used techniques A.I. techniques in data
mining are:
– Decision trees: Tree-shaped structures that represent sets of decisions.
These decisions generate rules for the classification of a dataset.
– Nearest neighbour method: A technique that classifies each record in
a dataset based on a combination of the classes of the k record(s) most
similar to it in a historical dataset. Sometimes called the k-nearest
neighbour technique; a clustering technique
– Rule induction: The extraction of useful if-then rules from data based
on statistical significance.
– Artificial neural networks: Predictive models that learn through
training and resemble biological neural networks in structure.
18
How Data Mining Works
• For example, say that you are the director of
marketing for a insurance company and
you'd like to acquire some new customers
– You could just randomly go out and mail
coupons to the general population. However
you would not achieve the required result.
– Alternatively As the marketing director you
have access to a lot of information about all of
your customers: their age, sex, income range
and credit card insurance.
19
How Data Mining Works
Customers
Prospects
General information (e.g.
demographic data)
Known
Known
Proprietary information (e.g.
customer transactions)
Known
Target
• The goal in prospecting is to make some decisions
about the information in the lower right hand
quadrant based on the model that we build going
from Customer General Information to Customer
Proprietary Information.
20
An Algorithm for Building
Decision
Trees
Consider the following using decision trees. The following is decision tree
algorithm:
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances in T.
3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents a unique value
for the chosen attribute.
-Use the child link values to further subdivide the instances into subclasses.
4. For each subclass created in step 3:
-If the instances in the subclass satisfy predefined criteria or if the set of
remaining attribute choices for this path is null, specify the classification
for new instances following this decision path.
-If the subclass does not satisfy the criteria and there is at least one attribute
to further subdivide the path of the tree, let T be the current set of subclass
instances and return to step 2.
21
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Income
Range
20-30K
2 Yes
2 No
30-40K
4 Yes
1 No
40-50K
1 Yes
3 No
50-60K
2 Yes
How Data Mining Works
• For instance, a simple
model for a
• Insurance company might
be:
Age
– Customers who earn
between 50 K to 60 K have
a life insurance policy.
• This model could then be
applied to the general
population to target those
for the life insurance
promotion.
• The tree can be more
complex e.g. See figure
opposite
<= 43
> 43
No (3/0)
Sex
Female
Male
Yes (6/0)
Credit
Card
Insurance
No
No (4/1)
Yes
Yes (2/0)
24
Data Mining Operations
• Data mining operations include:
– Predictive modelling: decision trees, regression
analysis…
– Database segmentation: clustering techniques
– Link analysis: decision trees, association rules
25
Predictive Modeling
• Applications of predictive
modelling include direct
marketing and use techniques
like decision trees.
Simple decision tree example
• uses observations to form a
model of the important
characteristics of some
phenomenon: e.g. those traits
associated with those who will
buy property
26
Database Segmentation
• Aim is to partition a database into an
unknown number of segments, or
clusters, of similar records.
• Uses clustering techniques in order to
group data
• Applications of database segmentation
include credit card fraud….
27
Database Segmentation using a Scatterplot
28
Link Analysis
• Aims to establish links between records,
or sets of records, in a database; one such
example would be association
discovery….
• Applications include product affinity
analysis.
• Finds items that imply the presence of
other items in the same event.
29
Link Analysis - Associations
Discovery
• Affinities between items are represented
by association discovery.
– e.g. ‘When a customer rents property for
more than 2 years and is more than 25 years
old, in 40% of cases, the customer will buy a
property. This association happens in 35%
of all customers who rent properties’.
30
Examples of Applications of Data Mining
• Retail / Marketing
– Predicting response to mailing campaigns
– Market basket analysis
• Banking:
– Detecting patterns of fraudulent credit card use.
• Insurance
– Claims analysis
• Medicine
– Identifying successful medical therapies for different
illnesses
31
Data mining in conclusion
• Two critical factors for success with data
mining are:
– a large, well-integrated data warehouse and
– a well-defined understanding of the business
process within which data mining is to be
applied (e.g. customer prospecting (target
marketing), retention, campaign management
etc.).
32
Sample types questions
• Discuss, using suitable examples how data
mining can contribute to companies making
a proactive knowledge driven decisions
which could help with formulation of a
companies strategy.
33
What is OLAP
• OLAP stands for "On-Line Analytical Processing.“
• OLTP ("On-Line Transaction Processing")
• OLAP describes a class of technologies that are designed
for live ad hoc data access and analysis.
• OLTP generally relies solely on relational databases,
• OLAP has become synonymous with multidimensional
views of business data supported by multidimensional
databases
• Relational databases were never intended to provide data
synthesis, analysis and consolidation functionality.
34
What is OLAP
• OLTP databases are optimised for transaction updating
however, OLAP applications are used by managers and
analysts for a higher level aggregate view of the data, thus
they are designed for analysis.
• Many problems that people try to solve using relational
databases e.g. summaries are handled much more
efficiently by an OLAP server than by RDBMS
35
Key OLAP Features
Although OLAP
applications are found in
widely divergent
functional areas, as
illustrate in the table
opposite. Moreover they
all have the following key
features:
1. multi-dimensional
views of data (MD
databases via Star
Schema)
2. Support complex
calculations
3. Time intelligence
36
Purchase Key
1
2
3
4
5
6
.
.
.
Purchase Dimension
Category
Supermarket
Travel & Entertainment
Auto & Vehicle
Retail
Restarurant
Miscellaneous
.
.
.
Star Schema: basis of MD view
Time Dimension
Time Key Month Day Quarter Year
10
Jan
5
1
2002
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Cardholder Key Purchase Key Location Key
1
2
1
15
4
5
1
2
3
.
.
.
.
.
.
.
.
.
Cardholder Key Name
1
John Doe
2
Sara Smith
.
.
.
.
.
.
Cardholder Dimension
Gender Income Range
Male
50 - 70,000
Female
70 - 90,000
.
.
.
.
.
.
Fact Table
Time Key Amount
10
14.50
11
8.25
10
22.40
.
.
.
.
.
.
Location Key Street
10
425 Church St
.
.
.
.
.
.
A star schema for credit card purchases
Location Dimension
City
State Region
Charleston SC
3
.
.
.
.
.
.
.
.
.
Multi-dimensional view as a cube: also
represented a 4 column table
Month = Dec.
Category = Vehicle
Region = Two
Amount = 6,720
Count = 110
Dec.
Nov.
Oct.
Sep.
Aug.
Month
• Example of threedimensional query.
• What is the total
amount and number of
purchases for vehicles
in region 2 for
December.
Jul.
Jun.
May
Apr.
Mar.
Feb.
Jan.
Multidimensional cube for credit card
purchases
Category
Miscellaneous
Restaurant
Retail
Vehicle
Travel
Supermarket
On
e
Tw
o
Re
Fo
ur
Th
ree
n
gio
Why Multidimensional Data
• Queries requiring only a single number to be
retrieved need not use multidimensional databases.
• If queries involved retrieving multiple numbers
and aggregating them for large databases can
become intolerable as relational databases can
scan only a few hundred records per second.
• However multidimensional databases can add up
10,000 or more numbers in rows and columns
per second.
• Thus for such queries multidimensional
databases have an enormous performance
advantage
39
Multi-dimensional Operations
• Slice – A single dimension operation
• Dice – A multidimensional operation
• Roll-up – A higher level of generalization
• Drill-down – A greater level of detail
• Rotation – View data from a new perspective
Simple Hierarchies: Roll up
• With hierarchical dimensions the database knows
not to combine members of the dimension that are
at different levels of the hierarchy: referred to as
roll-up
• It allows the user to view queries at all or any
different levels e.g.. At street level ,city level, state
level and region level. (refer to the above star
schema example )
• Such hierarchies facilitate drill down to successive
levels of detail: State level, city level, street level
41
Multiple hierarchies: roll up
• Utilising multiple hierarchies e.g. product sales
can roll up by region, by type , by brand name and
so forth. Without this capability an extra
dimension would have to be created for each.
• Another use of multiple hierarchies is for
geographical dimensions e.g.:
42
Drill down to core database
• Most organisations now utilise relational
databases as standard for their data warehouses.
• Often there is no need to replicate all the data in
the relational database into a MD database for
OLAP.
• Summary level data can be kept in the MD
database and detailed data in the relational
database.
43
Drilling to relational data
• To get a single number from a MD database takes
the same time as it does from a relational database.
• Thus it would be futile to individual customers
into a MD database. But for summarised data a
MD database is superior.
• Thus ideally you should be able to drill down
through the MD database into the relational
database.
• Such an approach is useful as most of data volume
will reside at the detailed level and will thus not
hinder queries of the higher levels
44
Support for complex calculations
• Important computational features of OLAP servers
inlcude:
– Independently dimensioned variables (IDV)
– Statistical calculations
– Consolidation speed
– Vector Arithmetic
45
OLAP calculations : Variables
• Variables are numeric measures (facts) such as Sales, Cost,
price…; dimensions include region, customer type,
product… : i.e. fact table and dimension tables
• OLAP servers can treat variables as a special dimension.
So one can select only the relevant dimensions for each
variable (IDV) . See next slide
• Must provide a range of powerful computational and
statistical methods such as that required by sales
forecasting: regression analysis , projection .
Correlations…
• They can also incorporate various rules for consolidation
46
Star schema for property sales of
DreamHome
47
Vector Arithmetic
• Data held in 2-D arrays [Matrix] can be more easily
manipulated than data stored in a relational table.
• Thus a 2-D plane for actual can be easily subtracted
from a plane from budget to give a plane for variance.
• Such arithmetic allows entire planes of the database to
be combined quickly.
48
Time Series Data Types
• Users want to look at trends in all aspects of their business
e.g. sales trends, market trends etc.
• A series of numbers representing a particular variable over
time is called a time series e.g.. 52 weekly sales numbers is
a time series.
• Utilising a time-series data type allows you to store an
entire string of numbers representing daily, weekly or
monthly data.
• Thus an OLAP server that supports time-series data type
allows one to store historical data without having to
specify a separate dimension for time.
• Unlike other dimensions time has special attributes and
rules.
49
Time-series data type
• Time series always have a particular periodicity.
• Time series data must include rules to convert one
periodicity to another
• In the absence of a time-series data type a new
dimension must be declared and labelled
explicitly.
• A time-series data cell contains a great deal of
information compared with a single cell or even a
full record.
50
Time-Series Data types
• Consider the following example for a time-series data
type of sales.
•
•
•
•
•
•
•
•
Start date = 1\1\2000
Periodicity = Daily, business days only
Conversion = Summation
Long description = Variable=Sales, Product=Nuts,
Region=East
Data type = Numeric, single precision
Sacristy = Non-sparse
Calendar = 445 Fiscal year
Data points = 708,800,821,743,779,856,878,902,799, ...
51
Time-series data types
• Start date is the first data point
• Periodicity can be daily, weekly etc with calendar
years, fiscal periods and business weeks etc being
understood.
• Data type can be single precision, double
precision, text strings or dates
• Sparse data is used where the same number is used
over and over again e.g. price. Defining it as
sparse would cause the database to store dates on
which the price changed and the corresponding
new values.
• Data points can store very long time series e.g. 10
years of daily data.
52
Sparse Data
• When less than 10% of the cells contain data the
database is said to be sparsely populated or sparse.
• Scarcity can also occur if there are many cells that
contain the same number e.g.. Price of a product
every day.
• This situation can also be represented by storing
the number once along with the number of days
that the number is repeated
• While a relational database would fill up the
database with duplicate data an OLAP server that
understands sparse data can skip over zeros,
missing data and duplicate data.
53
Conclusion
• In essence OLAP technology is a fast, flexible
data summarisation and analysis tool.
• The data analysis requires the ability to summarise
data in many ways and view trends.
• It should have 3 main characteristics: MD views,
ability to perform complex calculations, time
intelligence
54
Alternative Database topology:
The star schema
D.W.
O.L.A.P
Data mining
The Atomic Schema
Customer
Customer ID
Status Date
Cust Addr State
Cust ZIP Code
Customer Type
Customer Status
...
Cust Purchases
Customer ID
Activity Date
Product Code
Product Name
Sales Rep ID
Qty Purchased
Total Dollars
Promotion Flag
Product Ref
Product Code
ProdRef Eff. Date
ProdRef End Date
Product Name
Unit Price
Product Category
Product Type
Product Sub Type
Cust Averages
Customer ID
Cust Average Date
Cust Avg. End Date
Cust Avg. Rev.
Cust Longevity
Outlet Reference
Store ID
Store Name
Store Location
Distribution Channel
Sales Rep Ref
Sales Rep ID
Sales Person Name
Store ID
The Star Schema
Dimension Table 1
Dimension Table 3
Dimension Key 1
Fact Table
Dimension Key 3
Description 1
Aggregatn Lvl 1.1
Aggregatn Lvl 1.2
Aggregatn Lvl 1.n
Dimension Key 1
Dimension Key 2
Dimension Key 3
Dimension Key 4
Description 3
Aggregatn Lvl 3.1
Aggregatn Lvl 3.2
Aggregatn Lvl 3.n
Dimension Table 2
Dimension Key 2
Description 2
Aggregatn Lvl 2.1
Aggregatn Lvl 2.2
Aggregatn Lvl 2.n
Fact 1
Fact 2
Fact 3
Fact 4
.
.
.
Fact n
Dimension Table 4
Dimension Key 4
Description 4
Aggregatn Lvl 4.1
Aggregatn Lvl 4.2
Aggregatn Lvl 4.n
Dimension Table
Dimension Table 1
Dimension Key 1
Description 1
Aggregatn Lvl 1.1
Aggregatn Lvl 1.2
Aggregatn Lvl 1.n
• Describes the data that has been
organized in the Fact Table
• Key should either be the most
detailed aggregation level necessary
(e.g. country vs. county), if possible,
or...
• Surrogate keys may be necessary,
but will decrease the natural value of
the key
• Manageable number of aggregation
levels
Fact Table
Fact Table
Dimension Key 1
Dimension Key 2
Dimension Key 3
Dimension Key 4
Fact 1
Fact 2
Fact 3
Fact 4
.
.
.
Fact n
• Quantifies the data that has been
described by the Dimension Tables
• Key made up of unique combination of
values of dimension keys
–ALWAYS contains date or date dimension
• Fact values should be additive
–Aggregations of quantities or amounts
from atomic level
–No percentages or ratios
–May be non-additive, time-variant data
For Example:
Selling Responsibility
Sales Rep ID
Sales Rep Name
Store ID
Store Name
Store Location
Sales Channel
Product
Product Code
Product Name
Prod. Category
Product Type
Prod Sub Type
Customer Location
Cust ZIP Code
Purchases 1
Sales Rep ID
Product Code
Cust ZIP Code
Customer Type
Week Ending Date
Days of Activity
Unit Price
Total Quantity
Total Dollars
Returned Qty
Returned Dollars
Promotion Qty
City
State/Province
Country
Customer Type
Customer Type
Cust Type Desc
Date Information
Week Ending Date
Month
Quarter
Year
Star Schema Query
Select
E.Month, B.Customer_Type, C.Product_Type,
D.Store_Location, sum(A.Total_Quantity)
From
Purchases_1 A, Customer_Type B, Product C,
Selling_Responsibility D, Date_Information E
Where
B.Customer_Type = A.Customer_Type
C.Product_Code = A.Product_Code
and
D.Sales_Rep_ID = A.Sales_Rep_ID
and
E.Week_Ending_Date = A.Week_Ending_Date
E.Year = “1996”
C.Product_Category = “V”
Group by
E.Month, B.Customer_Type, C.Product_Type,
D.Store_Location;
and
and
and
Answer: Distinct Time Period
Fact Tables
Weekly
D1
D2
Date
Monthly
D3
D1
D4
D2
Date
D3
D4
• Create separate fact tables to account for different time
periods
• Date still part of each fact table key
• Same dimension tables used by both fact tables
• Improves overall performance (loading and accessing)
for each time period
• Will not increase amount of managed redundancy
Question
• Business decisions require the delivery of
critical information in a timely, suitable
format. Explain, using appropriate
examples, how OLAP can facilitate the
business decision making process.
63
Question
• A data warehouse, a data mining systems and
OLAP are 3 important technologies used in
facilitating business decision making. using a
suitable examples.
– The star schema is a database schema that can
be utilised by all three technologies: Describe,
using a simple example, The essential elements
of this schema
– (10 marks)
– Explain how the any two of the technologies
could be used to provide information to
formulate or derive simple business strategies.
– (20 marks)
Download