Uploaded by Mary Scheidel

Data Mining Basics: CISC4631 Presentation

advertisement
Department of Computer and Information Science
Data Mining Basics
CISC4631 Data Mining
L1 U2
1
CISC 4631
Department of Computer and Information Science
Outline
 From Data to Information
 What is NOT Data Mining?
 A KDD Process
 Data Mining Techniques
 Data Mining: Miscellaneous
2
CISC 4631
Department of Computer and Information Science
From data to information
 Data vs. Information? (Do you agree there are differences?)
 Data is a collection of facts, while information puts those facts into context.
 Data is raw and unorganized, while information is organized.
3
CISC 4631
Department of Computer and Information Science
From data to information
 Information is crucial
 Example 1: Covid-19

Given: Coronavirus exposure described by 30 more features

Problem: find virus strength and patient’s immune that may impact the patient longer

Data: historical records of patient and outcome
 Example 2: Inventing a new product

Given: market attraction described by 20 more requirements

Problem: perform market analysis to identify new product bundles

Data: historical records market behaviors, manufacturing issues, profiling customers with more accuracy.
 Example 3: Medicine

Given: the patient's information

Problem: provide accurate diagnostics

Data: historical medical records, physical examinations, and treatment patterns
4
CISC 4631
Department of Computer and Information Science
 Society produces huge amounts of data
 Sources: social, business, science, medicine, economics, geography,
environment, sports, …
“All scientists are data scientists”
- Monica Rogati, Senior Research Scientist @LinkedIn
Search Engines
 Early search engines used mainly keywords on a page – were
subject to manipulation
 Google success is due to its algorithm, which uses mainly links
to the page
 Google founders Sergey Brin and Larry Page were students in Stanford
doing research in databases and data mining in 1998, which led to Google
5
CISC 4631
Department of Computer and Information Science
o Growth Trends
 Moore’s law

Computer speed doubles every 18 months
 Storage law

Total storage doubles every 9 months
o Consequence

Very little data will ever be looked by a human
o Knowledge discovery is NEEDED to make sense of and
use data.
Data mining may help scientists in classifying
and segmenting data in Hypothesis Formation
http://www.intelfreepress.com/news/3d-xpoint-memory-storage/9790/
6
CISC 4631
Department of Computer and Information Science
Data Mining vs. Machine Learning
 DM finds/extracts information from data to provide a
conclusion
 Provide insights and information, or enable fast and
accurate decision-making
 Strong, accurate patterns are needed to make a decision

Problem 1: most patterns are not interesting

Problem 2: patterns may be inexact

Problem 3: data may be garbled or missing
 ML identifies patterns (e.g., model) in data, and
provides many tools for data mining
 ML provides structural descriptions
7
CISC 4631
Department of Computer and Information Science
 Structural descriptions
 Patterns that are found may be represented as structural
descriptions or as black-box models
 Example: if-then rules
If tear production rate = reduced
then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
Age
Spectacle
prescription
Astigmatism
Tear production
rate
Recommended
lenses
Young
Myope
No
Reduced
None
Young
Hypermetrope
No
Normal
Soft
Pre-presbyopic
Hypermetrope
No
Reduced
None
Presbyopic
Myope
Yes
Normal
Hard
…
…
…
…
…
8
CISC 4631
Department of Computer and Information Science
The weather problem
 Conditions for playing a certain game
Outlook
Temperature
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
Normal
False
Yes
…
…
…
…
…
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
9
CISC 4631
Department of Computer and Information Science
 Definitions of “learning” from dictionary:
To get knowledge of by study,
experience, or being taught
To become aware by information or
from observation
To commit to memory
To be informed of, ascertain; to receive instruction
Difficult to measure
Trivial for computers
• Operational definition:
Things learn when they change their behavior
in a way that makes them perform better in
the future.
10
CISC 4631
Does a slipper learn?
Department of Computer and Information Science
11
CISC 4631
Department of Computer and Information Science
http://www.differencebetween.net/technology/difference-between-data-mining-and-machine-learning/
12
CISC 4631
Department of Computer and Information Science
Outline
 From Data to Information
 What is NOT Data Mining?
 A KDD Process
 Data Mining Techniques
 Data Mining: Miscellaneous
13
CISC 4631
Department of Computer and Information Science
What is (not) Data Mining?
o What is not

o What is
Look up phone number in phone
directory
 Certain names are more prevalent in
certain US locations (O’Brien,
O’Rurke, O’Reilly… in Boston area)
 Query a Web search engine for
 Group together similar documents
information about “Amazon”
returned by search engine according
to their context (e.g., Amazon
rainforest, Amazon.com,)
 What are sold more on a particular
day than other days
 Total amount of books sold in a day
 Selection, interpretation
 Clustering, analysis,
characterization, discrimination
14
CISC 4631
Department of Computer and Information Science
Outline
 From Data to Information
 What is NOT Data Mining?
 A KDD Process
 Data Mining Techniques
 Data Mining: Miscellaneous
15
CISC 4631
Department of Computer and Information Science
Data Mining: A KDD Process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
16
CISC 4631
Department of Computer and Information Science
Data cleaning and
preprocessing: (may take
60% of effort!)
Data Warehouse
Data Cleaning
Data Integration
Databases
17
CISC 4631
Department of Computer and Information Science
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Databases
18
CISC 4631
Department of Computer and Information Science
Task-relevant Data
Data Selection
Data Preprocessing
•
•
Data Warehouse
Data Cleaning
Data Integration
Databases
19
CISC 4631
Data reduction and transformation:
Find useful features,
dimensionality/variable reduction,
invariant representation.
Department of Computer and Information Science
Data Mining
•
•
Task-relevant Data
•
•
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Databases
20
CISC 4631
Choosing functions of data mining
summarization, classification,
regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of
interest
Department of Computer and Information Science
Pattern Evaluation
Data Mining
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Databases
21
CISC 4631
Pattern evaluation and
knowledge presentation,
visualization,
transformation, removing
redundant patterns, etc.
Department of Computer and Information Science
Pattern Evaluation
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Databases
22
CISC 4631
Understanding
Data Mining
Department of Computer and Information Science
Outline
 From Data to Information
 What is NOT Data Mining?
 A KDD Process
 Data Mining Techniques
 Data Mining: Miscellaneous
23
CISC 4631
Department of Computer and Information Science
Data Mining Techniques
o Common data mining techniques or tasks
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]
24
CISC 4631
Department of Computer and Information Science
Classification
o Databases to be mined
o Knowledge to be mined
o Techniques to be utilized
o Applications to be adapted
25
CISC 4631
Department of Computer and Information Science
Classification: Definition
o A data mining task that assigns items in a collection to target
categories or classes.
o Goal:
 To accurately predict the target class for each case in the data.

26
Previously unseen records should be assigned a class as accurately as possible.
CISC 4631
Department of Computer and Information Science
Classification: Definition
o Training vs. Test set
 Training set

A subset to train a model
 Test set:


A subset to test the trained model
Used to determine the accuracy of the model.
o Classifier

27
Model to assign items in a collection to target classes
CISC 4631
Department of Computer and Information Science
o Classification rule:
predicts value of a given attribute (the classification of an example)
If outlook = sunny and humidity = high
then play = no
28
CISC 4631
Department of Computer and Information Science
Classification Example
used to assess the strength and utility of a predictive relationship
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
60K
No
No
Married
80K
?
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
used to discover potentially predictive relationships
29
10
Training
Set
CISC 4631
Learn
Classifier
Test
Set
Model
Department of Computer and Information Science
Classification: Application 1
o Direct Marketing
 Goal

Reduce the cost of mailing by targeting a set of consumers likely to buy.
 Approach:




30
Using similar products
{buy, don’t buy} decision forms
Various demographic, lifestyle
Type of business, social status
CISC 4631
Department of Computer and Information Science
Classification: Application 2
Classifying Galaxies
Early
Class:
•Stages of Formation
Attributes:
•Image features,
•Characteristics of light waves
received, etc.
Intermediate
Late
Data Size:
•72 million stars, 20 million galaxies
•Object Catalog: 9 GB
•Image Database: 150 GB
31
CISC 4631
Department of Computer and Information Science
Clustering: Definition
o Find “natural” grouping of instances given un-labeled
data
32
CISC 4631
Department of Computer and Information Science
o Finding a similarity measure,
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one another.


33
Euclidean distance
Other problem-specific measures.
CISC 4631
Department of Computer and Information Science
o Market Segmentation:
 Goal

Subdividing a market into distinct subsets of customers
 Approach:



34
Different attributes of customers << geographical, lifestyle, business
Finding clusters of similar customers
Measuring the clustering quality by observing buying patterns
CISC 4631
Department of Computer and Information Science
o Document Clustering:
 Goal

To find groups of documents that are similar to each other
 Approach

35
Identify frequently occurring terms
CISC 4631
Department of Computer and Information Science
Association Rule Discovery
o Given a set of records each of which contain some number of items
from a given collection
 Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
36
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
CISC 4631
Department of Computer and Information Science
Regression
o Predict a value of a given continuous valued variable based on the
values of other variables
 Greatly studied in statistics, neural network fields.
o Examples:
 Predicting sales amounts
 Predicting wind velocities
 Time series prediction of stock market indices
37
CISC 4631
Department of Computer and Information Science
Outline
 From Data to Information
 What is NOT Data Mining?
 A KDD Process
 Data Mining Techniques
 Data Mining: Miscellaneous
38
CISC 4631
Department of Computer and Information Science
Large-scale Endeavors
Products
SAS
SPSS
Oracle
(Darwin)
IBM
Clustering Classification Association Sequence Deviation
Decision
Trees



ANN
Time
Series
Decision
Trees


DBMiner
(Simon Fraser)
39


√
√

Weka



Colab
√
√
√
CISC 4631
Specially, for ML
Department of Computer and Information Science
Next
o Data Warehouse
40
CISC 4631
Download