Data Mining

advertisement
DATA MINING
1
Introduction Outline
Goal: Provide an overview of data mining.
•
•
•
•
•
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
2
Introduction
• Data is growing at a
phenomenal rate
• Users expect more
sophisticated information
• How?
UNCOVER HIDDEN INFORMATION
DATA MINING
3
Data Mining Definition
• Finding hidden information in a
database
• Fit data to a model
• Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
4
Database Processing vs. Data
Mining Processing
• Query
– Well defined
– SQL

Data
• Query
– Poorly defined
– No precise query language

– Operational data

Output
– Precise
– Subset of database
Data
– Not operational data

Output
– Fuzzy
– Not a subset of database
5
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
6
Data Mining Models and Tasks
7
Basic Data Mining Tasks
• Classification maps data into predefined
groups or classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a
real valued prediction variable.
• Clustering groups similar data together into
clusters.
– Unsupervised learning
– Segmentation
– Partitioning
8
Basic Data Mining Tasks
(cont’d)
• Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships among
data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential
patterns.
9
Ex: Time Series Analysis
•
•
•
•
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior
10
Data Mining vs. KDD
• Knowledge Discovery in Databases
(KDD): process of finding useful
information and patterns in data.
• Data Mining: Use of algorithms to extract
the information and patterns derived by
the KDD process.
11
KDD Process
Modified from [FPSS96C]
• Selection: Obtain data from various sources.
• Preprocessing: Cleanse data.
• Transformation: Convert to common format.
Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results
to user in meaningful manner.
12
KDD Process Ex: Web Log
• Selection:
– Select log data (dates and locations) to use
• Preprocessing:
– Remove identifying URLs
– Remove error logs
• Transformation:
– Sessionize (sort and group)
• Data Mining:
– Identify and count patterns
– Construct data structure
• Interpretation/Evaluation:
– Identify and display frequently accessed sequences.
• Potential User Applications:
– Cache prediction
– Personalization
13
Data Mining Development
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Neural Networks
•Decision Tree Algorithms
14
Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
15
Data Mining Metrics
•
•
•
•
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
16
Database Perspective on Data
Mining
•
•
•
•
Scalability
Real World Data
Updates
Ease of Use
17
Classification vs. Prediction
• Classification
– predicts categorical class labels (discrete or
nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
• Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical applications
– Credit approval
– Target marketing
March 24, 2016
Data Mining: Concepts and Techniques
18
18
Classification—A Two-Step
Process
• Model construction: describing a set of predetermined
classes
– Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting
will occur
19
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels
are
not and
known
March 24, 2016
Data Mining:
Concepts
Techniques
19
Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
March 24, 2016
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
20
Data Mining: Concepts and Techniques
20
Process (2): Using the Model in
Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
March 24, 2016
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Data Mining: Concepts and Techniques
Tenured?
21
21
Ex2: Illustrating Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
22
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes
23
or
clusters
in
the
data
March 24, 2016
Data Mining: Concepts and Techniques
23
Issues: Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and
handle missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
24
March 24, 2016
Data Mining: Concepts and Techniques
24
Issues: Evaluating Classification Methods
• Accuracy
– classifier accuracy: predicting class label
– predictor accuracy: guessing value of predicted
attributes
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as
decision tree size or compactness of
25
rulesData Mining: Concepts and Techniques
Marchclassification
24, 2016
25
Related Concepts Outline
Goal: Examine some areas which are related to
data mining.
•
•
•
•
•
•
•
•
•
Database/OLTP Systems
Fuzzy Sets and Logic
Information Retrieval(Web Search Engines)
Dimensional Modeling
Data Warehousing
OLAP/DSS
Statistics
Machine Learning
Pattern Matching
26
Information Retrieval
• Information Retrieval (IR): retrieving desired
information from textual data.
• Library Science
• Digital Libraries
• Web Search Engines
• Traditionally keyword based
• Sample query:
Find all documents about “data mining”.
DM: Similarity measures;
Mine text/Web data.
27
IR Query Result Measures and
Classification
IR
Classification
28
Dimensional Modeling
• View data in a hierarchical manner more as
business executives might
• Useful in decision support systems and mining
• Dimension: collection of logically related
attributes; axis for modeling data.
• Facts: data stored
• Ex: Dimensions – products, locations, date
Facts – quantity, unit price
DM: May view data as dimensional.
29
Relational View of Data
ProdID
123
123
150
150
150
150
200
300
500
500
1
LocID
Dallas
Houston
Dallas
Dallas
Fort
Worth
Chicago
Seattle
Rochester
Bradenton
Chicago
Date
022900
020100
031500
031500
021000
Quantity
5
10
1
5
5
UnitPrice
25
20
100
95
80
012000
030100
021500
022000
012000
20
5
200
15
10
75
50
5
20
25
30
Dimensional Modeling Queries
•
•
•
•
•
Roll Up: more general dimension
Drill Down: more specific dimension
Dimension (Aggregation) Hierarchy
SQL uses aggregation
Decision Support Systems (DSS):
Computer systems and tools to assist
managers in making decisions and
solving problems.
31
Cube view of Data
32
Aggregation
Hierarchies
33
Data Warehousing
• “Subject-oriented, integrated, time-variant, nonvolatile”
William Inmon
• Operational Data: Data used in day to day needs of
company.
• Informational Data: Supports other functions such as
planning and forecasting.
• Data mining tools often access data warehouses rather
than operational data.
DM: May access data in warehouse.
34
Operational vs. Informational
Application
Use
Temporal
Modification
Orientation
Data
Size
Level
Access
Response
Data Schema
Operational Data
Data Warehouse
OLTP
Precise Queries
Snapshot
Dynamic
Application
Operational Values
Gigabits
Detailed
Often
Few Seconds
Relational
OLAP
Ad Hoc
Historical
Static
Business
Integrated
Terabits
Summarized
Less Often
Minutes
Star/Snowflake
35
OLAP
• Online Analytic Processing (OLAP): provides more
complex queries than OLTP.
• OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
• Dimensional data; cube view
• Visualization of operations:
– Slice: examine sub-cube.
– Dice: rotate cube to look at another dimension.
– Roll Up/Drill Down
DM: May use OLAP queries.
36
OLAP Operations
Roll Up
Drill Down
Single Cell
Multiple Cells
Slice
Dice
37
Download