Decision Tree Construction

advertisement
The Software Infrastructure
for Electronic Commerce
Databases and Data Mining
Lecture 3:
An Introduction To Data Mining (I)
Johannes Gehrke
johannes@cs.cornell.edu
http://www.cs.cornell.edu/johannes
Lectures Three and Four
• Data preprocessing
• Multidimensional data analysis
• Data mining
• Association rules
• Classification trees
• Clustering
Why Data Preprocessing?
• Quality decisions come from quality data.
• Problems with real life data:
• Data needs to be integrated from different
sources
• Missing values
• Noisy and inconsistent values
• Data is not at the right level of aggregation
Recall: The Relational Data Model
•
•
A relational database is a set of relations
A relation has two components:
1. The relation instance. Basically a table, with rows
(also: records, tuples) and columns (also: fields,
attributes).
Number of records in the relation instance:
Cardinality.
2. The relation schema. Specifies name of the
relation, plus name and type of each column.
•
Turing award (Nobel price in CS) for Codd in
1981 for his work on the relational model
Example: Customer Relation
• Relation schema:
Customers (cid: integer, name: string, byear: integer,
state: string)
• Relation instance:
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
Data Integration
• Integrate data from multiple sources into
a common format for data mining.
• Note: A good data warehouse has already
taken care of this step.
Data Warehouse
Server
OLTP
DBMSs
Other Data
Sources
Extract, clean,
transform, aggregate,
load, update
Data Marts
Data Integration (Contd.)
Problem: Heterogeneous schema integration
• Different attribute names
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
Customer-ID
1
2
3
state
NY
CA
NY
• Different units: Sales in $, sales in Yen, sales in
DM
Data Integration (Contd.)
Problem: Heterogeneous schema integration
• Different scales: Sales in dollars versus sales in
pennies
• Derived attributes: Annual salary versus monthly
salary
cid
1
2
3
monthlySalary
5000
2400
3000
cid
6
7
8
Salary
50,000
100,000
40,000
Data Integration (Contd.)
Problem: Inconsistency due to redundancy
• Customer with customer-id 150 has three children in
relation1 and four children in relation2
cid
1
numChildren
3
cid
1
numChildren
4
• Computation of annual salary from monthly salary in
relation1 does not match “annual-salary” attribute in
relation2
cid
1
2
monthlySalary
5000
6000
cid
1
2
Salary
60,000
80,000
Missing Values
Often the values of some attributes in a
record are not known.
Reasons for missing values:
•
•
•
•
•
Attribute does not apply (e.g., maiden name)
Inconsistency with other recorded data
Equipment malfunction
Human errors
Attribute introduced recently (e.g., email
address)
Missing Values: Approaches
• Ignore the record
• Complete the missing value:
• Manual completion: Tedious and likely to be
infeasible
• Fill in a global constant, e.g., “NULL”,
“unknown”
• Use the attribute mean or mode
• Construct a data mining model that predicts
the missing value
Noisy Data
• Examples:
•
•
•
•
•
•
Faulty data collection instruments
Data entry problems, misspellings
Data transmission problems
Technology limitation
Inconsistency in naming conventions
Duplicate records with different values for a
common field
Noisy Data: Remove Outliers
Noisy Data: Smoothing
y
Y1
Y1’
X1
x
Noisy Data: Normalization
• Scale data to fall within a small, specified
range
•
•
•
•
Leave out extreme order statistics
Min-max normalization
Z-score normalization
Normalization by decimal scaling
Data Reduction
Problem:
• Data might not be at the right scale for
analysis.
Example: Individual phone calls versus
monthly phone call usage
• Complex data mining tasks might run a
very long time.
Example: Multi-terabyte data warehouses
One disk drive: About 20MB/s
Data Reduction: Attribute Selection
• Select the “relevant” attributes for the data
mining task
• If there are k attributes, there are 2k-1 different
subsets
• Example: {salary,children,byear}, {salary,children},
{salary,byear}, {children,byear}, {salary}, {children},
{byear}
• Choice of the right subset depends on:
• Data mining task
• Underlying probability distribution
Attribute Selection (Contd.)
• How to choose relevant attributes:
• Forward selection: Select greedily one
attribute at a time
• Backward elimination: Start with all
attributes, eliminate irrelevant attributes
• Combination of forward selection and
backward elimination
Data Reduction: Parametric Models
• Main idea:
• Fit a parametric model to the data (e.g.,
multivariate normal distribution)
• Store the model parameters, discard the data
(except for outliers)
Parametric Models: Example
• Instead of storing
(x,y) pairs, store only
the x-value.
Then recompute the
y-value using
y = ax + b
y
x
Data Reduction: Sampling
• Choose a representative subset of the
data
• Simple random sampling may have very
poor performance in the presence of skew
Data Reduction: Sampling (Contd.)
Stratified sampling: Biased sampling
• Example: Keep population group ratios
• Example: Keep minority population group count
Data Reduction: Histograms
• Divide data into buckets and store average
(sum) for each bucket
• Can be constructed “optimally” for one attribute
using dynamic programming
• Example:
Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,
7,7,8,8,9,9,10,10,11,11,12,12
Histogram: (range, count, sum)
(1-2,12,16), (3-6,8,36), (7-9,6,48), (10-12,6,66)
Histograms (Contd.)
• Equal-width histogram
• Divides the domain of an attribute into k intervals of
equal size
• Interval width = (Max – Min)/k
• Computationally easy
• Problems with data skew and outliers
• Example:
Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,
7,7,8,8,9,9,10,10,11,11,12,12
Histogram: (range, count, sum)
(1-3,14,22), (4-6,6,30), (7-9,6,48), (10-12,6,66)
Histograms (Contd.)
• Equal-depth histogram
• Divides the domain of an attribute into k intervals,
each containing the same number of records
• Variable interval width
• Computationally easy
• Example:
Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,
7,7,8,8,9,9,10,10,11,11,12,12
Histogram: (range, count, sum)
(1,8,8), (2-4,8,22), (5-8,8,52), (9-12,8,84)
Data Reduction: Discretization
• Same concept as histograms
• Divide domain of a numerical attribute into
intervals.
• Replace attribute value with label for interval.
• Example:
• Dataset (age; salary):
(25;30,000),(30;80,000),(27;50,000),
(60;70,000),(50;55,000),(28;25,000)
• Discretized dataset (age, discretizedSalary):
(25,low),(30,high),(27,medium),(60,high),
(50,medium),(28,low)
Discretization: Natural Hierarchies
• Replace low-level concepts with high-level
concepts
• Example replacements:
• Product SKU by category
• City by state
Industry
Country
Category
State
Product
City
Year
Quarter
Month
Week
Day
Data Reduction: Aggregation
• Natural hierarchies on attributes can be
used to aggregate data along the
hierarchy
Industry
Country
Category
State
Product
City
Year
Quarter
Month
Week
Day
Aggregation: Example
cid
1
1
1
1
1
2
2
2
2
2
3
4
4
Date
3/1/2000
3/5/2000
3/29/2000
4/2/2000
4/6/2000
3/2/2000
3/5/2000
3/10/2000
3/11/2000
4/7/2000
4/2/2000
3/17/2000
4/25/2000
Amount
$150
$50
$200
$300
$200
$100
$250
$100
$50
$200
$300
$250
$100
cid Month Year Amount Visits
1
3
2000 $400
3
1
3
2000 $500
2
2
3
2000 $500
4
2
4
2000 $200
1
3
4
2000 $300
1
4
3
2000 $250
1
4
4
2000 $100
1
Data Reduction: Other Methods
• Principal component analysis
• Fourier transformation
• Wavelet transformation
Data Preprocessing: Summary
• Problems during data
integration:
•
•
•
•
•
Different attribute names
Different units
Different scales
Derived attributes
Redundant data
• Missing values
• Imputation
• Prediction
• Noisy data:
• Outlier removal
• Smoothing
• Normalization
• Data Reduction:
•
•
•
•
•
•
Attribute selection
Fitting parametric models
Sampling
Histograms
Discretization
Aggregation
Data Analysis
• Data preprocessing
• Multidimensional data analysis
• Data mining
• Association rules
• Classification trees
• Clustering
Multidimensional Data Analysis
Recall:
Transactions(ckey, timekey, pkey, units, price)
Customers(ckey, cid, name, byear, city, state, country)
Time(tkey, day, month, quarter, year)
Products(pkey, pname, price, pid, category, industry)
Hierarchies on dimensions:
Industry
Country
Category
State
Product
City
Year
Quarter
Month
Week
Day
Multidimensional Data Analysis
NY
CA
WI
Industry1
$1000
$2000
$1000
Industry2
$500
$1000
$500
Industry3
$3000
$3000
$3000
Industry
Country=“USA”
Category
State
Product
City
Year
Quarter
Month
Week
Day
Corresponding Query in SQL
• SELECT SUM(units)
FROM Transactions T, Products P, Customers C
WHERE T.pkey = P.pkey AND T.ckey = C.ckey
AND C.country = “USA”
GROUP BY P.industry, C.state
• We think that Industry3 in CA is interesting.
Industry
Country=“USA”
Category
State
Product
City
Year
Quarter
Month
Week
Day
Slice and Drill-Down
Category1
San
Francisco
$300
San
Jose
$300
Los
Angeles
$400
Category2
$300
$300
$400
Category3
$100
$800
$100
Industry=“Industry3”
Country
Category
State=“CA”
Product
City
Year
Quarter
Month
Week
Day
Corresponding Query in SQL
• SELECT SUM(units)
FROM Transactions T, Products P, Customers C
WHERE T.pkey = P.pkey AND T.ckey = C.ckey
AND P.industry = “Industry3” AND C.state = “CA”
GROUP BY P.category, C.city
• We think that Category3 is interesting.
Industry=“Industry3”
Country
Category
State=“CA”
Product
City
Year
Quarter
Month
Week
Day
Slice and Drill-Down
Product1
San
Francisco
$20
San
Jose
$160
Los
Angeles
$20
Product2
$20
$160
$20
Product3
$60
$480
$60
Industry
Country
Category=“Category3”
State=“CA”
Product
City
Year
Quarter
Month
Week
Day
Corresponding Query in SQL
• SELECT SUM(units)
FROM Transactions T, Products P, Customers C
WHERE T.pkey = P.pkey AND T.ckey = C.ckey
AND C.state = “CA” AND P.category = “Category3”
GROUP BY P.product, C.city
• Nothing new in this view of the data.
Industry
Country
Category=“Category3”
State=“CA”
Product
City
Year
Quarter
Month
Week
Day
Pivot To (City, Year)
1997
San
Francisco
$20
San
Jose
$100
Los
Angeles
$20
1998
$20
$600
$20
1999
$60
$100
$60
Industry
Country
Category=“Category3”
State=“CA”
Product
City
Year
Quarter
Month
Week
Day
Corresponding Query in SQL
• SELECT SUM(units)
FROM Transactions T, Products P, Customers C
WHERE T.pkey = P.pkey AND T.ckey = C.ckey
AND C.state = “CA” AND P.category = “Category3”
GROUP BY C.city, T.year
Industry
Country
Category=“Category3”
State=“CA”
Product
City
Year
Quarter
Month
Week
Day
Multidimensional Data Analysis
Set of data manipulation operators
• Roll-up: Go up one step in a dimension hierarchy.
Example: month -> quarter
• Drill-down: Go down one step in a dimension
hierarchy. Example: quarter -> month
• Slice: Select a subset of the values of a dimension.
Example: All categories -> only Category3
• Dice: Select all values of a dimension. Example: Only
Category3 -> all categories
• Pivot: Select new dimensions to visualize the data.
Example: Pivot to Time(quarter) and Customer(state)
Visual Intuition: Cube
roll-up to category
Customer Data Mart
roll-up to state
SH
SF
Product
LA
Product1
Product2
Product3
Product4
Product5
Product6
20
30
20
15
10
50
roll-up to week
M T W Th F S S
Time
50 Units of Product6 sold on Monday in LA
OLAP Server Architectures
• Relational OLAP (ROLAP)
• Relational DBMS stores data mart
• OLAP middleware:
• Aggregation and navigation logic
• Optimized for DBMS in the background, but slow and
complex
• Basically only one vendor: Microstrategy
• Multidimensional OLAP (MOLAP)
• Specialized array-based storage structure
• Vendors: Hyperion (Essbase), Appix (iTM1), Oracle,
Microsoft
OLAP Server Architectures
• Desktop OLAP (DOLAP)
• Performs OLAP directly at your PC
• Vendors: Cognos (Powerplay), Business
Objects, Brio Technology, Hummingbird
• Hybrids and Application OLAP
• More: www.olapreport.com
The OLAP Market
Revenues For The OLAP Market
(OLAP Report 11/1998)
3.5
Billion $
3.0
2.5
2.0
Revenue
1.5
1.0
0.5
0.0
1994
1995
1996
1997
1998
1999
2000
Enterprise Reporting Tools
Market Size Forecast for
Enterprise Reporting Tools (IDC 11/1998)
1000
Million $
800
600
Market Size
400
200
0
1997
1998
1999
2000
2001
2002
Summary: Multidimensional Analysis
• Spreadsheet style data analysis
• Roll-up, drill-down, slice, dice, and pivot
your way to interesting cells in the CUBE
• Mainstream technology
• Established enterprises already have OLAP
installations
• When establishing your e-business, OLAP
will be your first step in data analysis
Data Mining
Definition
Data mining is the exploration and analysis
of large quantities of data in order to
discover valid, novel, potentially useful,
and ultimately understandable patterns in
data.
Example pattern (Census Bureau Data):
If (relationship = husband), then (gender = male). 99.6%
Definition (Cont.)
Data mining is the exploration and analysis of large quantities of data
in order to discover valid, novel, potentially useful, and ultimately
understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and
comprehend the patterns.
Why Use Data Mining Today?
Human analysis skills are inadequate:
• Volume and dimensionality of the data
• High data growth rate
Availability of:
•
•
•
•
Data
Storage
Computational power
Off-the-shelf software
An Abundance of Data
•
•
•
•
•
•
•
•
•
Supermarket scanners, POS data
Preferred customer cards
Credit card transactions
Direct mail response
Call center records
Web server logs
Customer web site trails
ATM machines
Demographic data
Evolution of Database Technology
• 1960s: IMS and network DBMS
• 1970s: The relational data model, small
relational DBMS implementations
• 1980s: Maturing RDBMS, application-specific
DBMS (spatial data, scientific data, images, etc.)
• 1990s: Mature, high-performance RDBMS
technology, parallel DBMS, terabyte data
warehouses, object-relational DBMS,
middleware and web technology
• 2000s: High availability, maintainability,
seamless integration into business processes
Computational Power
• Moore’s Law: In 1965, Intel Corp.
cofounder Gordon Moore predicted that
the density of transistors in an integrated
circuit would double every year.
• Later changed to reflect 18 months
progress.
• Experts on ants estimate that there are
1016 to 1017 ants on earth. In the year
1997, we produced one transistor per ant.
OTS-Software for Data Mining
•
•
•
•
•
•
•
ANGOSS KnowledgeStudio
IBM Intelligent Miner
Metaputer PolyAnalyst
SAS Enterprise Miner
SGI Mineset
SPSS Clementine
Many others
• More at http://www.kdnuggets.com/software
Why Use Data Mining Today?
And … competitive pressure!
“The secret of success is to know something that
nobody else knows.”
Aristotle Onassis
• Competition on service, not only on price
(Banks, phone companies, hotel chains, rental
car companies)
• Personalization
• Your competitors already apply data mining
The Knowledge Discovery Process
Steps:
1. Identify business problem
2. Data preprocessing
•
•
•
•
•
Data selection: Identify target datasets and
relevant fields
Data cleaning: Remove noise and outliers
Data transformation
Create common units
Generate new fields
3. Data mining and model evaluation
4. Deployment and measurement of the results
Preprocessing and Mining
Knowledge
Patterns
Target
Data
Preprocessed
Data
Interpretation
Model
Construction
Original Data
Preprocessing
Data
Integration
and Selection
What is a Data Mining Model?
A data mining model is a description of a
specific aspect of a dataset. It produces
output values for an assigned set of input
values.
Examples:
• Linear regression model
• Classification model
• Clustering
Data Mining Models (Contd.)
A data mining model can be described at two
levels:
• Functional level:
• Describes model in terms of its intended usage.
Examples: Classification, clustering
• Representational level:
• Specific representation of a model.
Example: Log-linear model, classification tree,
nearest neighbor method.
• Black-box models versus transparent models
Data Mining Techniques
• Supervised learning
• Classification and regression
• Unsupervised learning
• Clustering
• Dependency modeling
• Associations, summarization, causality
• Outlier and deviation detection
• Trend analysis and change detection
Fields Related to Data Mining
•
•
•
•
•
Database systems
Machine learning
Statistics
Visualization
Parallel computing
Traditional Analysis Tools
• Predefined set of reports
• Capture most important known business questions
• Generated at fixed points in time
• No support for impromptu queries
• SQL: “Write your own SQL query.”
• Statistics department within a company
• Slow reaction to users needs
• Limited understanding of the business questions
• Problems communicating the findings
Example Application: Sports
IBM Advanced Scout analyzes
NBA game statistics
• Shots blocked
• Assists
• Fouls
• http://www.research.ibm.com/scout/home.html
Advanced Scout
• Example pattern: An analysis of the
data from a game played between
the New York Knicks and the Charlotte
Hornets revealed that “When Glenn Rice played
the shooting guard position, he shot 5/6 (83%)
on jump shots."
• Pattern is interesting:
The average shooting percentage for the
Charlotte Hornets during that game was 54%.
Example Application: Sky Survey
• Input data: 3 TB of image data with 2 billion sky
objects, took more than six years to complete
• Goal: Generate a catalog with all objects and
their type
• Method: Use decision trees as data mining
model
• Results:
• 94% accuracy in predicting sky object classes
• Increased number of faint objects classified by 300%
• Helped team of astronomers to discover 16 new high
red-shift quasars in one order of magnitude less
observation time
Gold Nuggets?
• Investment firm mailing list: Discovered that old people do not
respond to IRA mailings
• Bank clustered their customers. One cluster: Older customers, no
mortgage, less likely to have a credit card
• “Bank of 1911”
• Customer churn example
Market Basket Analysis
• Consider shopping cart filled with several
items
• Market basket analysis tries to answer the
following questions:
• Who makes purchases?
• What do customers buy together?
• In what order do customers purchase items?
Market Basket Analysis
Given:
• A database of
customer transactions
• Each transaction is a
set of items
• Example:
Transaction with TID
111 contains items
{Pen, Ink, Milk, Juice}
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Market Basket Analysis (Contd.)
• Coocurrences
• 80% of all customers purchase items X, Y and
Z together.
• Association rules
• 60% of all customers who purchase X and Y
also buy Z.
• Sequential patterns
• 60% of customers who first buy X also
purchase Y within three weeks.
Confidence and Support
We prune the set of all possible association
rules using two interestingness measures:
• Confidence of a rule:
• X => Y has confidence c if P(Y|X) = c
• Support of a rule:
• X => Y has support s if P(XY) = s
We can also define
• Support of an itemset (a coocurrence) XY:
• XY has support s if P(XY) = s
Example
Examples:
• {Pen} => {Milk}
Support: 75%
Confidence: 75%
• {Ink} => {Pen}
Support: 100%
Confidence: 100%
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Exercise
• Find all itemsets with
support >= 75%?
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Exercise
• Can you find all
association rules with
support >= 50%?
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Extensions
• Imposing constraints
•
•
•
•
•
•
•
Only find rules involving the dairy department
Only find rules involving expensive products
Only find “expensive” rules
Only find rules with “whiskey” on the right hand side
Only find rules with “milk” on the left hand side
Hierarchies on the items
Calendars (every Sunday, every 1st of the month)
Market Basket Analysis: Applications
• Sample Applications
•
•
•
•
•
Direct marketing
Fraud detection for medical insurance
Floor/shelf planning
Web site layout
Cross-selling
Beyond Support and Confidence
Example: 5000 students
• 3000 students play basketball
• 3750 students eat cereal
• 2000 students both play basketball and eat
cereal
Cereal
No cereal
Sum
Basketball No basketball
2000
1750
1000
250
3000
2000
Sum
3750
1250
5000
Misleading Association Rules
Basketball No basketball
Cereal
2000
1750
No cereal
1000
250
Sum
3000
2000
Sum
3750
1250
5000
• Basketball => Cereal (support: 40%,
confidence: 66.7%) is misleading because 75%
of students eat cereal
• Basketball => No cereal (support: 20%,
confidence: 33.3%) is more interesting,
although with lower support and confidence
Interest
Interest of rule A => B: P(AB)/(P(A)*P(B))
• Symmetric (uses both P(A) and P(B))
• Note that confidence is not symmetric
(confidence of rule A => B: P(AB)/P(A))
Interest values:
• Interest = 1: A and B are independent
(P(AB)=P(B)*P(A))
• Interest > 1: A and B are positively correlated
• Interest < 1: A and B are negatively
correlated
Interest: Example
Basketball No basketball
Cereal
2000
1750
No cereal
1000
250
Sum
3000
2000
Itemset
Cereal, basketball
Cereal, no basketball
No cereal, basketball
No cereal, no basketball
Sum
3750
1250
5000
Support Interest
40%
35%
20%
5%
1.125
0.857
0.750
2.000
Extensions
• Imposing constraints
•
•
•
•
•
•
•
Only find rules involving the dairy department
Only find rules involving expensive products
Only find “expensive” rules
Only find rules with “whiskey” on the right hand side
Only find rules with “milk” on the left hand side
Hierarchies on the items
Calendars (every Sunday, every 1st of the month)
Applications
• Sample Applications
•
•
•
•
•
Direct marketing
Fraud detection for medical insurance
Floor/shelf planning
Web site layout
Cross-selling
• Case study
Summary
• Data preprocessing
• Quality analysis comes from quality data
• Multidimensional data analysis
• Interactive, spreadsheet-style data analysis
• Data mining
• The knowledge discovery process
• Association rules
Questions?
(In the fourth lecture: More data mining.)
Download