Uploaded by Dhirendra Thapa

datamining

advertisement
Data Warehousing and data
mining
g in e‐Government
I
’ definition
d fi iti
Inmons’s
A data warehouse is
‐subject‐oriented,
‐integrated,
g
,
‐time‐variant,
‐nonvolatile
collection of data in support of management’s
decision making process.
Subject oriented
Subject‐oriented
 Data warehouse is organized around subjects
such as sales,product,customer.
 It focuses on modeling and analysis of data for
decision makers.
 Excludes data not useful in decision support
process.
Integration
y integrating
g
g
 Data Warehouse is constructed by
multiple heterogeneous sources.
 Data Preprocessing are applied to ensure
consistency.
RDBMS
Legacy
System
Flat File
Data
Warehouse
Data Processing
Data Transformation
Time‐variant
 Provides information from historical
perspective e.g. past 5‐10 years
 Every key structure contains either implicitly or
explicitly an element of time
Nonvolatile
 Data once recorded cannot be updated.
updated
 Data warehouse requires two operations in data
accessing
g
 Initial loading of data
 Access of data
load
access
Operational v/s Information System
Features
Operational
Information
Characteristics
Operational processing
Informational processing
Orientation
Transaction
Analysis
User
Clerk,DBA,database
professional
Knowledge workers
Function
y to day
y operation
p
Day
Decision support
pp
Data
Current
Historical
View
Detailed,flat relational
Summarized,
multidimensional
DB design
Application oriented
Subject oriented
Unit of work
Short ,simple transaction Complex query
Access
Read/write
Mostly read
Operational v/s Information System
Features
Operational
Information
Focus
Data in
Information out
Number of records
accessed
tens
millions
Number of users
thousands
hundreds
DB size
100MB to GB
100 GB to TB
Priority
High performance,high High flexibility,endavailability
user autonomy
Metric
Transaction throughput Query througput
Data Warehousing Architecture
Monitoring &
Administratio
n
Metadata
M d
Repository
OLAP Servers
Reconciled data
External
Sources
Extract
Transform
Load
Refresh
Analysis
Serve
Query/Reportin
g
Operational
Dbs
Data Mining
DATA SOURCES
TOOLS
DATA MARTS
Data Warehouse Architecture
 Data Warehouse server
 almost always a relational DBMS,rarely flat files
 OLAP servers
 to support and
d operate on multi‐dimensional
l i di
i
l data
d
structures
 Clients
 Query and reporting tools
 Analysis tools
 Data mining tools
Building Data Warehouse
 Data Selection
 Data Preprocessing
 Fill missing
i i values
l
 Remove inconsistency
 Data Transformation & Integration
 Data Loading
Data in warehouse is stored in form of fact tables
and dimension tables.
d
Case Study
g is a new company
p y
 Afco Foods & Beverages
which produces dairy,bread and meat products
with production unit located at Baroda.
 There
Th
products
d
are sold
ld iin N
North,North
hN hW
West
and Western region of India.
 They have sales units at Mumbai,
Mumbai Pune ,
Ahemdabad ,Delhi and Baroda.
 The President of the company
p y wants sales
information.
l Information
f
Sales
Report: The number of units sold.
113
Report: The number of units sold over time
January
y
February
y
March
April
p
14
41
33
25
Sales Information
Report : The number of items sold for each product with
time
Jan Feb Mar Apr
Wheat Bread
6
17
8
Cheese
6
16
6
Swiss Rolls
8
25
21
Product
Sales
Information
Report: The number of items sold in each City for each product with time
Feb Mar
Mumbai Wheat Bread
Pune
3
Cheese
3
16
6
Swiss Rolls
4
16
6
Wheat Bread
3
Cheese
3
Swiss Rolls
4
Apr
10
Time
Jan
7
8
9
15
Produc
t
S l Information
I f
ti
Sales
Report: The number of items sold and income in each region for
each product with time.
Jan
Rs
Feb
U
Rs
Mar
U
Mumbai Wheat Bread
Pune
Apr
p
Rs
U
Rs
U
7.44
3
24.80
10
17 36
17.36
7
21.20
8
Cheese
7.95
3
42.40
16
15.90
6
Swiss Rolls
7.32
4
29.98
16
10.98
6
7 44
7.44
3
Wheat Bread
Cheese
7.95
3
Swiss Rolls
7.32
4
16.47
9
27.45
15
S l Measures
M
i
Sales
& Di
Dimensions
 Measure – Units sold, Amount.
 Dimensions – Product,Time,Region.
S l Data
D t W
h
M d l
Sales
Warehouse
Model
Fact Table
City
Product
Mumbai
Month
Units
Rupees
Wheat Bread January
3
7.95
Mumbai
Cheese
January
4
7.32
Pune
Wheat Bread January
3
7.95
P
Pune
Ch
Cheese
J
January
4
7 32
7.32
Mumbai
Swiss Rolls
February
16
42.40
l Data Warehouse
h
d l
Sales
Model
City_ID Prod_ID
Month
Units
Rupees
1
589
1/1/1998
3
7.95
1
1218
1/1/1998
4
7.32
2
589
1/1/1998
3
7.95
2
1218
1/1/1998
4
7 32
7.32
1
589
2/1/1998
16
42.40
Sales Data Warehouse Model
Product Dimension Tables
Prod_ID
Product_Name
Product_Category_ID
589
Wheat
h Bread
d
1
590
White Bread
1
288
Coconut Cookies
2
Product_Category
g y_Id Product_Category
g y
1
Bread
2
Cookies
Sales Data Warehouse Model
Region Dimension Table
City_ID
City
Region
Country
1
Mumbai
West
India
2
Pune
NorthWest
India
Sales Data Warehouse Model
Time
Sales Fact
Region
Product
Product
Category
O li Analysis
A l i Processing(OLAP)
P
i (OLAP)
Online
 It enables analysts, managers and executives to gain
insight into data through fast, consistent, interactive
access to a wide variety of possible views of information
that has been transformed from raw data to reflect the
real dimensionality of the enterprise as understood by
the user.
Produc
t
Data
Warehouse
Time
C b
OLAP Cube
City
Product
Time
Units
Dollars
All
All
All
113
251.26
Mumbai
b i
All
ll
All
ll
64
146.07
Mumbai
White Bread
All
38
98.49
Mumbai
Wheat Bread All
13
32 24
32.24
Mumbai
Wheat Bread Qtr1
3
7.44
Mumbai
Wheat Bread March
3
7.44
O
ti
OLAP Operations
Drill Down
Product
Category e.g Electrical Appliance
Sub Category e.g Kitchen
Product e.g Toaster
Time
O
ti
OLAP Operations
Drill Up
Product
Category e.g Electrical Appliance
Sub Category e.g Kitchen
Product e.g Toaster
Time
O
ti
OLAP Operations
Slice and Dice
Product
Product=Toaster
Time
Time
O
ti
OLAP Operations
Pivot
Product
Product
Time
Region
OLAP Server
 An OLAP Server is a high capacity,multi
capacity multi user
data manipulation engine specifically designed
to support
pp and operate
p
on multi‐dimensional
data structure.
 OLAP server available are
 MOLAP server
 ROLAP server
 HOLAP server
Presentation
Product
Reporting
Tool
Report
Time
Data Warehousing includes
 Build Data Warehouse
 Online analysis processing(OLAP).
 Presentation.
Presentation
Cleaning ,Selection &
I t
Integration
ti
Presentation
RDBMS
Flat File
Warehouse & OLAP server
Client
d ffor Data Warehousing
h
Need
 Industry has huge amount of operational data
 Knowledge
g worker wants to turn this data into
useful information.
 This information is used by them to support
strategic decision making .
N d for
f Data
D Warehousing
W h i (contd..)
(
d )
Need
 It is a platform for consolidated historical data
for analysis.
 It stores data of good quality so that knowledge
worker can make correct decisions.
Need for Data Warehousing (contd..)
 From business perspective
g weapon
p
‐it is latest marketing
‐helps to keep customers by learning more
about their needs .
‐valuable tool in today’s competitive fast
evolving world.
Data Warehousing Tools
 Data Warehouse
 SQL Server 2000 DTS
 Oracle 8i Warehouse Builder
 OLAP tools
 SQL Server Analysis Services
 Oracle Express Server
 Reporting tools
 MS Excel Pivot Chart
 VB Applications
Purpose of the DW
 Make information accessible
 Make information consistent
What Is Data Mining?
g
 Data mining (knowledge discovery from data)
 Extraction of interesting (non‐trivial,
(non trivial implicit,
implicit previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge
g extraction, data/pattern
p
analysis,
y
data archeology,
gy
data dredging, information harvesting, business intelligence,
etc.
 Watch out: Is everything “data
data mining
mining”??
 Simple search and query processing
 (Deductive) expert systems
Knowledge Discovery (KDD) Process
 Data mining—core of
Pattern Evaluation
knowledge
k
l d discovery
di
process
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
D
t
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
Data Mining: Confluence of Multiple Disciplines
atabase
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
S i i
Data Mining
g
Algorithm
Visualization
Other
Disciplines
Why Not Traditional Data Analysis?
 Tremendous amount of data
bytes
 Algorithms must be highly scalable to handle such as tera
tera‐bytes
of data
 High‐dimensionality of data
 Micro‐array may have
h
tens off thousands
h
d off dimensions
d
 High complexity of data
 Data streams and sensor data
 Time‐series data, temporal data, sequence data
 Structure data, graphs, social networks and multi‐linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
Multi‐Dimensional View of Data Mining
 Data to be mined
 Relational, data warehouse, transactional, stream, object‐
oriented/relational,
series, text,
media,
oriented/relational active,
active spatial,
spatial time
time‐series
text multi
multi‐media
heterogeneous, legacy, WWW
 Knowledge to be mined
 Characterization,
Ch
i i
discrimination,
di i i i
association,
i i
classification,
l ifi i
clustering, trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Database‐oriented, data warehouse (OLAP), machine learning,
statistics visualization
statistics,
visualization, etc
etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio‐data mining,
stock
k market
k analysis,
l
text mining, Web
b mining, etc.
Data Mining: Classification Schemes
 General functionality
 Descriptive data mining
 Predictive data mining
 Different
Diff
t views
i
lead
l d to
t different
diff
t classifications
l ifi ti
 Data view: Kinds of data to be mined
 Knowledge
K
l d view:
i
Ki d off knowledge
Kinds
k
l d tto be
b di
discovered
d
 Method view: Kinds of techniques utilized
 Application
A li i view:
i
Ki d off applications
Kinds
li i
adapted
d
d
Data Mining: On What Kinds of Data?
 Database‐oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced
sets and
d
d data
d
d advanced
d
d applications
l
 Data streams and sensor data
 Time‐series
Time series data,
data temporal data
data, sequence data (incl
(incl. bio
bio‐sequences)
sequences)
 Structure data, graphs, social networks and multi‐linked data
 Object
Object‐relational
relational databases
 Heterogeneous databases and legacy databases
 Multimedia database
 Text databases
 The World‐Wide Web
Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation
i off background
b k
d knowledge
k
l d
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration
I t
ti off the
th discovered
di
d knowledge
k
l d with
ith existing
i ti one: knowledge
k
l d fusion
f i
 User interaction
 Data mining query languages and ad‐hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction
 Applications and social impacts
 Domain‐specific
p
data mining
g & invisible data mining
g
 Protection of data security, integrity, and privacy
Summary
 Data mining: Discovering interesting patterns from large amounts of
data
 A natural evolution of database technology, in great demand, with wide
applications
 Includes data cleaning, data integration, data selection, transformation,
data mining, pattern evaluation, and knowledge presentation
 Mining
Mi i can b
be performed
f
d iin a variety
i
off information
i f
i repositories
i i
 Data mining functionalities: characterization, discrimination,
association,, classification,, clustering,
g, outlier and trend analysis,
y , etc.
 Data mining systems and architectures
 Major issues in data mining
Data warehousing and data mining
in government
 Data warehousing and data mining technologies have
extensive potential application in the government
 Such
S h as agriculture,
i l
rurall development,
d l
Health
H l h and
d
energy and national activities of government
National Data warehouses
 Census data
‐A data warehouse can be build from this database upon
OLAP techniques
h i
can b
be applied.
li d
‐Data mining also can be performed for analysis and
knowledge discovery
 Prices of essential commodities
Applications of Data Warehousing
and data mining in e‐Government
 Agriculture
 Rural development
 Health
 Planning
 Education
 Commerce and trade
 Tourism
Download