CS 404 Data Mining & Knowledge Discovery -- FS01-L1

CS/EngMt/CpEng 404
Data Mining
&
Knowledge
Discovery
Dan St. Clair
Lect 1 – Intro. To Data Mining &
Data Warehouses
Information Age Produces Large
Amounts of Data
• Data collected on almost everything
• WWW rich data resource
• Data warehouses required to hold
data
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
2
The problem:
How do we turn information into useful
knowledge?
Solution:
Data mining & knowledge discovery
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
3
Data Mining & Knowledge
Discovery
This class provides
• Tools & techniques for producing useful
knowledge from information
• Experience in using these tools
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
4
Data Mining & Knowledge Discovery in
CS 404
• We will study
–
–
–
–
Data warehouses
Classification & Association rule miners (C4.5)
Neural networks (BP, SOM)
Classical tools
• Correlation
• Regression
• Clustering
• We will do several projects requiring mining
knowledge from “real” data
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
5
CS 404 Class Information
Prerequisites:
CS 347 (Artificial Intelligence) or CS 304
(Database Systems)
and Stat 215
Texts:
• Han, J. & Kamber, M., Data Mining: Concepts
and Techniques, Morgan Kaufmann, 2000.
• Quinlan, J., C4.5 Programs for Machine
Learning, Morgan Kaufmann, 1988.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
6
CS 404 Class Information
Reference:
(This or a similar Matlab reference is recommended.)
Hanselman, D. and Littlefield, B., Mastering Matlab 6:
A Comprehensive Tutorial and Reference, Prentice
Hall, 2001.
Software:
• C4.5 – provided to class w/o charge
• Matlab – Can purchase from Mathworks or can login
to UMR.
• Microsoft Excel (provided on UMR CLC computers)
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
7
CS 404 Class Information
Instructor:
D.C. St. Clair, Ph.D.
325 Computer Science
Phone: (573) 341-6352
e-mail: stclair@umr.edu
(Cont.d)
Fax: (573) 341-4501
Class web page:
www.umr.edu/~stclair or
http://web.umr.edu/~stclair/class/classfiles/cs404_fs02/
Things you will find on the class web page:
•
•
•
•
Syllabus
Schedule
Homework assignments
Lecture notes
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
8
Who am I?
• Professor and Chair UMR Computer Science Dept.
• Research area -- Data mining, machine intelligence, neural networks
diagnostics
intelligent graphics
data mining
pattern recognition & analysis
system monitoring & assessment
• “Applied” experience
–
–
–
–
–
Union Pacific Technologies Intelligent Systems Advisor
Visiting Principal Scientist McDonnell Douglas Research Laboratories
NASA’s Johnson Space Center
Defense: Navy, Army, and Air Force
Co-founder & former Chief Scientist of intelligent software systems
company
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
9
Even More
CS 404 Class Information
Han, one of the authors of the data mining text has a web page at:
www.cs.sfu.ca/~han/DM_Book.html
Which contains several interesting things including:
1.
A list of errata for the data mining book
2.
A set of slides he uses in the data mining course he teaches.
[I will be using some of these slides in my lectures.]
You may want to check these out.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
10
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
We just
finished this.
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
11
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
12
Data -- Information -- Knowledge
T h e set of valu es:
12345
67890
1 0 0 0 .0 0
2 8 4 6 .9 2
SA
CK
h as n o m ean in g. It is d a ta b u t it is N O T in fo rm a tio n .
In fo rm a tio n : In form ation is th e resu lt of organ izin g d ata in to m ean in gfu l q u an tities.
T h e follow in g relation al tab le h elp s tu rn s th e d ata in to in form ation sin ce it associates m ean in g
w ith th e d ata:
A ccou n t
N u m b er
12345
67890
B alan ce
1 0 0 0 .0 0
2 8 4 6 .9 2
typ e
SA
CK
A d a ta b a se is a “stru ctu red ” collection of d ata stored an d op erated on w ith in a m an agem en t
en viron m en t k n ow n as a D a ta b a se M a n a g em en t S y stem s (D B M S ) or d a ta b a se sy stem . T h e
D B M S h elp s to tran sform d ata in to in form ation .
Knowledge can be created from information.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
13
What Is Data Mining?
How Does It Differ From Existing Database Technologies?
Data Sources: Databases, data warehouses, Internet
Decision Support Systems
Tools for asking questions & doing analyses when you know what
you want to ask and where you are going. (Ex. OLAP tools)
Data Mining
Process of discovering knowledge (meaningful new correlations,
patterns, and trends) in data by sifting through large amounts of
data (100M-10G) using pattern recognition as well as statistical and
mathematical techniques.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
14
Other Names Used in Conjunction With
Data Mining
•
•
•
•
•
•
•
Knowledge discovery(mining) in databases (KDD)
Knowledge extraction
Data/pattern analysis
Data archeology
Data dredging
Information harvesting
What is not data mining
– (Deductive) query processing
– Expert systems or small ml/statistical programs
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
15
P ote n tia l-C us tom er*
P erso n
A ge
A nn S m ith
32
Joan G ra y
53
M ary B lythe
27
Jane B row n
55
B ob S m ith
30
Jack B row n
50
Data
Mining
Example
M arried -T o
H usba nd
B ob S m ith
Jack B row n
K n o w led g e W ith in A R ela tio n
S ex
F
F
F
F
M
M
Inco m e
10,000
1,000,000
20,000
20,000
100,000
200,000
C usto m e r
yes
yes
no
yes
yes
yes
W ife
A nn S m ith
Jane B row n
IF In co m e(P erso n )  1 0 0 ,0 0 0 T H E N P o ten tia l-C u sto m er(P erso n )
IF S ex(P erso n ) = F A N D A g e(P erso n )  3 2 T H E N P o ten tia l-C u sto m er(P erso n )
K n o w led g e F ro m M u ltip le R ela tio n s
IF
M a rried -T o (P erso n ,S p o u se) A N D In co m e(P erso n )  1 0 0 0 0 0
T H E N P o ten tia l-C u sto m er(S p o u se)
IF
M a rried -T o (P erso n ,S p o u se) A N D P o ten tia l-C u sto m er(P erso n )
T H E N P o ten tia l-C u sto m er(S p o u se).
* D zeroski, S aso, Inductive L ogic Program m ing and K nowledge D iscovery in D atabases , A dvances in K now ledg e D iscovery and
D ata M ining , E d. U. F ayyad, G .P iatetsky-S hapiro, P . S myth, & R . Uthurusamy, A A A I P ress, 1996 , pp. 117-152.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
17
Simple Concept Learning -- Example
“Routine”, “well-understood” chemistry experiment performed numerous times.
• Expected result occurred about half the time
• Unexpected result occurred remainder of the time
Numerous repetitions of experiment produced similar results
Careful analysis determined:
• One result produced when setup was in sunlight
• Second result produced when setup was in shade
Careful investigation showed:
Experiment sensitive to ultraviolet radiation
Result:
Patented method for determining presence of ultraviolet radiation
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
18
The Knowledge Discovery
Process
Interpretation/
Evaluation
Data Mining
Transformation
Preprocessing
Selection
Data
Sources
Knowledge
Patterns /
Models
Transformed
Data
Preprocessed
Data
Target
Data
 2002 by D. C. St. Clair
404 Data Mining & Knowledge Discovery
19
Source:
Fayyad, U., Piatetsky-Shapiro, G., Smyth, CS
P, From
Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
20
Data Sources
•
•
•
•
•
•
Relational Databases
Data Warehouses
WWW
Audio
Video
Printed Materials
:
:
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
21

Relational
Databases
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
22
Multidimensional Data Cube
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000
23
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
24
Data Mining Tasks
• Predictive
– Perform inference on current data
• Descriptive (KDD)
– Characterize general properties of data
Notes:
– A measure of certainty or “belief” must be
associated with each pattern
– “Interesting” patterns must be identified
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
25
Kinds of Data Patterns to Be “Mined”
• Concept/class description
• Association analyses
• Classification & prediction
• Cluster analysis
• Outlier analysis
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
26
Concept/class Descriptions
Example 1
Produce a description summarizing characteristics of customers
who purchase diapers
• Objective: produce a description of those in the target class
• Characterizes class/concept
Example 2
What properties identify diaper buyers from other store
customers?
• Discriminates class/concept
• Leads to other questions
– What else do they buy
– When do they purchase these items?
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
27
Association Analysis
Assoc. Anal. -- discovery of association
relationships between attribute-value
conditions.
Such relationships may be expressed in many ways.
On common way is through association rules.
X => Y
 2002 by D. C. St. Clair
A 1^.....^ A m  B 1^....^ B n
CS 404 Data Mining & Knowledge Discovery
28
Association Rules
Example
age (X, “20 .. 29”) ^ income (X, “20K..29K”) =>
buys (X, “CD changer)
[support = 2% confidence = 60% ]
% of data instances
satisfying all three
components of rule
 2002 by D. C. St. Clair
% of data instances where
hypothesis is satisfied and
conclusion is predicted
correctly
CS 404 Data Mining & Knowledge Discovery
29
Classification & Prediction
o
Debt
o
x
o
o
x
x
o
x
o
o
o
x
x
x
o
x
x
o
o
x
o
o
Income
 2002 by D. C. St. Clair
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge
CS 404 Data Mining & Knowledge Discovery
Discovery In Databases, AI Magazine, Fall 1996.
30
Classification (nonlinear)
o
No Loan
Debt
o
x
o
o
x
x
o
x
o
o
o
x
x
x
o
x
o
x
x
o
o
o
Loan
Income
 2002 by D. C. St. Clair
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge
CS 404 Data Mining & Knowledge Discovery
Discovery In Databases, AI Magazine, Fall 1996.
31
Cluster Analysis
+
Debt
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Income
 2002 by D. C. St. Clair
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge
CS 404 Data Mining & Knowledge Discovery
Discovery In Databases, AI Magazine, Fall 1996.
32
Some Major Data Mining Issues
• Mining methodologies
• User interaction
• Performance (accuracy, robustness)
• Heterogeneous databases
• Interestingness
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
33
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
34
The Knowledge Discovery
Process
Interpretation/
Evaluation
Data Mining
Transformation
Preprocessing
Selection
Data
Sources
Knowledge
Patterns /
Models
Transformed
Data
Preprocessed
Data
Target
Data
 2002 by D. C. St. Clair
404 Data Mining & Knowledge Discovery
35
Source:
Fayyad, U., Piatetsky-Shapiro, G., Smyth, CS
P, From
Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Chapter 2: Data Warehousing and OLAP
Technology for Data Mining
• What is a data warehouse?
• A multi-dimensional data model
• Data warehouse architecture
• Data warehouse implementation
• From data warehousing to data mining
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
36
What Is a Data Warehouse?
DWs provide architectures and tools to support
the systematic
–organization,
–understanding, and
–use of data.
Note: DWs may consist of data from numerous
sources including business, scientific, as well as
engineering data.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
37
Features of a Data Warehouse
• Subject-oriented -- organized around major subjects
• Integrated -- integrates multiple heterogeneous data
sources
– Relational databases
– Flat files
– On-line transaction records
• Consistency is enforced
• Time-variant -- data stored to provide historical data
• Nonvolatile
– Physically separate from operational environment
– Operations on data: initial loading & retrieval
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
38
OLTP vs. OLAP
O LTP
O LAP
u sers
clerk, IT professional
know ledge w orker
f u n ction
day to day operations
decision support
D B d esign
application-oriented
subject-oriented
d ata
current, up -to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad -hoc
lots of scans
u n it of w ork
read/w rite
index/hash on prim. key
short, simple transaction
# record s accessed
tens
millions
#u sers
thousands
hundreds
D B size
100M B -G B
100G B -T B
m etric
transaction throughput
query throughput, response
u sage
access
complex query
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
39
Topics to Be Covered in Lecture 1
Intro. to Data Mining & Knowledge Discovery
•
•
•
•
•
•
 2002 by D. C. St. Clair
Intro. to CS 404
What is Data Mining & KD?
Data sources
Data mining tasks
Data wareshousing (Ch. 2)
Multidimensional data models & schema
CS 404 Data Mining & Knowledge Discovery
40
Multidimensional Data Models
Figure 2.1 3-D data cube
AllElectronics sales data
2002 by
D. C. St. Clair
404
Data Mining
Knowledge
Discovery
Allfigure
references
in this lecture are to the text: Han, CS
J. &
Kamber,
M., &Data
Mining:
Concepts and Techniques, Morgan Kaufmann, 2000.
41
4-D Data Cube of AllElectronics Sales
Data
Figure 2.2 4-D data cube
AllElectronics sales data
2002 by
D. C. St. Clair
404
Data Mining
Knowledge
Discovery
Allfigure
references
in this lecture are to the text: Han, CS
J. &
Kamber,
M., &Data
Mining:
Concepts and Techniques, Morgan Kaufmann, 2000.
42
Fig. 2.3 A Lattice of Cuboids
all
time
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
time,item,location
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
43
Conceptual Modeling of Data
Warehouses
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
– Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
44
Fig. 2.4 Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
 2002 by D. C. St. Clair
Slide is modified from slides provided by Han, J. &
Kamber,
M., Data
Mining:
ConceptsDiscovery
and
CS
404 Data
Mining
& Knowledge
Techniques, Morgan Kaufmann, 2000.
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
45
Fig. 2.5 Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
 2002 by D. C. St. Clair
Slide is modified from slides provided by Han, J. &
Kamber,
M., Data
Mining:
ConceptsDiscovery
and
CS
404 Data
Mining
& Knowledge
Techniques, Morgan Kaufmann, 2000.
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
46
Fig 2.6 Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
shipper_key
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
Measures
 2002 by D. C. St. Clair
time_key
from_location
branch_key
branch
Shipping Fact Table
Slide is modified from slides provided by Han, J. &
Kamber,
M., Data
Mining:
ConceptsDiscovery
and
CS
404 Data
Mining
& Knowledge
Techniques, Morgan Kaufmann, 2000.
units_shipped
shipper
shipper_key
shipper_name
location_key
47
shipper_type
A Data Mining Query Language,
DMQL: Language Primitives
• Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
• Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
• Special Case (Shared Dimension Tables)
– First time as “cube definition”
– define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
48
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch,
location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city, province_or_state, country)
 2002 by D. C. St. Clair
CS 404 Data Mining & Knowledge Discovery
49
CS/EngMt/CpEng 404
Data Mining
&
Knowledge
Discovery
Dan St. Clair
Lect 1 – Intro. To Data Mining &
Data Warehouses
Program
Completed
University of Missouri-Rolla
Copyright 2001 Curators of University of Missouri