Basic Concepts in Data Mining T US N V

advertisement
THE US NATIONAL VIRTUAL OBSERVATORY
Basic Concepts in Data Mining
Kirk Borne
George Mason University
2008 NVO Summer School
1
OUTLINE
•
•
•
•
•
The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
What’s next?
2008 NVO Summer School
2
OUTLINE
•
•
•
•
•
The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
What’s next?
2008 NVO Summer School
3
The Scientific Data Flood
Scientific Data Flood
Large Science
Project
Pipeline
2008 NVO Summer School
4
The New Face of Science – 1
• Big Data (usually geographically distributed)
–
–
–
–
–
–
High-Energy Particle Physics
Astronomy and Space Physics
Earth Observing System (Remote Sensing)
Human Genome and Bioinformatics
Numerical Simulations of any kind
Digital Libraries (electronic publication repositories)
• e-Science
–
–
–
–
Built on Web Services (e-Gov, e-Biz) paradigm
Distributed heterogeneous data are the norm
Data integration across projects & institutions
One-stop shopping: “The right data, right now.”
2008 NVO Summer School
5
The New Face of Science – 2
• Databases enable scientific discovery
– Data Handling and Archiving (management of massive
data resources)
– Data Discovery (finding data wherever they exist)
– Data Access (WWW-Database interfaces)
– Data/Metadata Browsing (serendipity)
– Data Sharing and Reuse (within project teams; and by
other scientists – scientific validation)
– Data Integration (from multiple sources)
– Data Fusion (across multiple modalities & domains)
– Data Mining (KDD = Knowledge Discovery in Databases)
2008 NVO Summer School
6
OUTLINE
•
•
•
•
•
The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
What’s next?
2008 NVO Summer School
7
So what is Data Mining?
• Data Mining is Knowledge Discovery in
Databases (KDD)
• Data mining is defined as “an information
extraction activity whose goal is to discover
hidden facts contained in (large) databases.”
• Note: Machine Learning is the field of
Computer Science research that focuses on
algorithms that learn from data.
• Data Mining is the application of Machine
Learning algorithms to large databases.
2008 NVO Summer School
8
Scientific Data Mining
Data Mining is the Killer App for Scientific Databases.
• Scientific Data Mining References:
– http://voneural.na.infn.it/
– http://astroweka.sourceforge.net/
– http://www.itsc.uah.edu/f-mass/
• Framework for Mining and Analysis of Space Science data (F-MASS)
• Data mining is used to find patterns and relationships in data.
(EDA = Exploratory Data Analysis)
• Patterns can be analyzed via 2 types of models:
– Descriptive : Describe patterns and create meaningful subgroups
or clusters. (Unsupervised Learning, Clustering)
– Predictive : Forecast explicit values, based upon patterns in
known results. (Supervised Learning, Classification)
• How does this apply to Scientific Research? …
– through KNOWLEDGE DISCOVERY
Data  Information  Knowledge  Understanding / Wisdom!
2008 NVO Summer School
9
Astronomy Example
Data:
(a) Imaging data (ones & zeroes)
Information (catalogs / databases):
(b) Spectral data
(ones & zeroes)
– Measure brightness of galaxies from image (e.g., 14.2 or 21.7)
– Measure redshift of galaxies from spectrum (e.g., 0.0167 or 0.346)
Knowledge:
Hubble Diagram 
Redshift-Brightness
Correlation 
Redshift = Distance
Understanding: the Universe is expanding!!
2008 NVO Summer School
10
Astronomers have been doing
Data Mining for centuries
“The data are mine, and
you can’t have them!”
• Seriously ...
• Astronomers love to classify things ...
(Supervised Learning. e.g., classification)
• Astronomers love to characterize things ...
(Unsupervised Learning. e.g., clustering)
• And we love to discover new things ...
(Semi-supervised Learning. e.g., outlier detection)
2008 NVO Summer School
11
This sums it up ...
• Characterize the new
(clustering)
• Assign the known
(classification)
• Discover the unknown
(outlier detection)
Graphic from S. G. Djorgovski
• 2 benefits of very large data sets within a scientific domain:
• best statistical analysis of “typical” events
• automated search for “rare” events
2008 NVO Summer School
12
OUTLINE
•
•
•
•
•
The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
What’s next?
2008 NVO Summer School
13
Database Systems and Data Mining
• Data mining brings novel non-traditional (Machine Learning)
concepts to large DBMS (e.g., association mining; neural
networks; decision trees; link analysis; pattern recognition;
classification; regression; self-organizing maps). For
example:
– Clustering Analysis = group together similar items, and
separate the dissimilar items
– Classification = predict the class label
– Regression = predict a numeric attribute value
– Association Analysis = detect attribute-value conditions
that occur frequently together
2008 NVO Summer School
14
Data Mining Methods and Some Examples
Clustering
Classification
Associations
Neural Nets
Decision Trees
Pattern Recognition
Correlation/Trend Analysis
Principal Component
Analysis
Independent Component
Analysis
Regression Analysis
Outlier/Glitch Identification
Visualization
Autonomous Agents
Self-Organizing Maps (SOM)
Link (Affinity Analysis)
Group together similar items and
separate dissimilar items in DB
Classify new data items using
the known classes & groups
Find unusual co-occurring associations
of attribute values among DB items
Predict a numeric attribute value
Organize information in the
database based on relationships
among key data descriptors
Identify linkages between data items
based on features shared in common
2008 NVO Summer School
15
Some Data Mining Techniques
Graphically Represented
Clustering
Link Analysis
Self-Organizing Map (SOM)
Decision Tree
2008 NVO Summer School
Neural Network
Outlier (Anomaly)
Detection
16
Categories of Machine Learning
and some Examples
• Supervised Learning
– Classification
• Unsupervised Learning
– Clustering
– Link Analysis
– Association Analysis
• Semisupervised Learning
– Outlier Detection
– Class Discovery
2008 NVO Summer School
17
Some Classification Algorithms
Classification = the process of learning and then applying a function
that classifies the data into a set of predefined classes.
• Bayes Theorem
• Support Vector
Machines (SVM)
• Decision Trees
• Regression
• Neural Networks
• Markov Modeling
• K-Nearest Neighbors
2008 NVO Summer School
18
Classification - a 2-Step Process
1.
Model Construction (Description): describing a set of
predetermined classes = Build the Model.
–
–
–
2.
Each data element/tuple/sample is assumed to belong to a
predefined class, as determined by the class label attribute
The set of tuples used for model construction = the training set
The model is represented by classification rules, decision trees,
or mathematical formulae
Model Usage (Prediction): for classifying future or unknown
objects, or for predicting missing values = Apply the Model.
–
It is important to estimate the accuracy of the model:
• The known labels of the test sample are compared with the
classification results from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is chosen completely independent of the training set,
otherwise overfitting will occur – overfitting is a bad thing!
2008 NVO Summer School
19
Classification Methods:
Decision Trees, Neural Networks,
SVM (Support Vector Machines)
There are 2 Classes!
How do you ...
-Separate them?
-Distinguish them?
-Learn the rules?
-Classify them?
Apply
Kernel
(SVM)
2008 NVO Summer School
20
Some Clustering Algorithms
Clustering = the process of partitioning a set of data into subsets or
clusters such that a data element belonging to a cluster is more
similar to data elements belonging to that same cluster than to the
data elements belonging to other clusters.
– Squared Error
– Nearest Neighbor
– K-Means (most
popular)
– Mixture Models
(statistical)
2008 NVO Summer School
21
Clustering is used to discover the different
unique groupings (classes) of attribute values.
The case shown below is not obvious: one or two groups?
2008 NVO Summer School
22
This case is easier: there are two groups.
(in fact, this is the same set of data elements as shown on the
previous slide, but plotted here using a different attribute.)
2008 NVO Summer School
23
Semi-supervised Learning:
Outlier Detection and Class Discovery
Figure: The clustering of data
clouds (dc#) within a
multidimensional parameter
space (p#).
Such a mapping can be used to
search for and identify clusters,
voids, outliers, one-of-kinds,
relationships, and associations
among arbitrary parameters in
a database (or among various
parameters in geographically
distributed databases).
•
•
statistical analysis of “typical” events
automated search for “rare” events
2008 NVO Summer School
24
Outlier Detection: Serendipitous Discovery
of Rare or New Objects & Events
2008 NVO Summer School
25
Principal Components Analysis &
Independent Components Analysis
Cepheid Variables:
Cosmic Yardsticks
-- One Correlation
-- Two Classes!
... Class Discovery!
2008 NVO Summer School
26
Why use Data Mining?
Here are 6 reasons...
1. Most projects now collect massive quantities of data.
2. Because of the enormous potential for new discoveries in
existing huge databases.
3. Data mining moves beyond the analysis of past events to
predicting future trends and behaviors that may be missed
because they lie outside experts’ expectations.
4. Data mining tools can answer complex questions that
traditionally were too time- consuming to resolve.
5. Data mining tools can explore the intricate interdependencies
within databases in order to discover hidden patterns and
relationships.
6. Data mining allows decision-makers to make proactive,
knowledge-driven decisions.
2008 NVO Summer School
27
OUTLINE
•
•
•
•
•
The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
What’s next?
2008 NVO Summer School
28
Basic Concepts = Key Steps
• The key steps in a data mining project usually
invoke and/or follow these basic concepts:
–
–
–
–
–
–
–
–
–
–
Data browse, preview, and selection
Data cleaning and preparation
Feature selection
Data normalization and transformation
Similarity/Distance metric selection
... Select the data mining method
... Apply the data mining method
... Gather and analyze data mining results
Accuracy estimation
Avoiding overfitting
2008 NVO Summer School
29
Key Concept for Data Mining:
Data Previewing
• Data Previewing allows you to get a sense of
the good, bad, and ugly parts of the database
• This includes:
–
–
–
–
–
–
–
–
Histograms of attribute distributions
Scatter plots of attribute combinations
Max-Min value checks (versus expectations)
Summarizations, aggregations (GROUP BY)
SELECT UNIQUE values (versus expectations)
Checking physical units (and scale factors)
External checks (cross-DB comparisons)
Verify with input DB
2008 NVO Summer School
30
Key Concept for Data Mining:
Data Preparation = Cleaning the Data
• Data Preparation can take 40-80% (or more)
of the effort in a data mining project
• This includes:
–
–
–
–
–
–
–
–
–
Dealing with NULL (missing) values
Dealing with errors
Dealing with noise
Dealing with outliers (unless that is your science!)
Transformations: units, scale, projections
Data normalization
Relevance analysis: Feature Selection
Remove redundant attributes
Dimensionality Reduction
2008 NVO Summer School
31
Key Concept for Data Mining:
Feature Selection – the Feature Vector
• A feature vector is the attribute vector for a database
record (tuple).
• The feature vector’s components are database
→
attributes: v = {w,x,y,z}
• It contains the set of database attributes that you have
chosen to represent (describe) uniquely each data
element (tuple).
– This is only a subset of all possible attributes in the DB.
• Example: Sky Survey database object feature vector:
– Generic: {RA, Dec, mag, redshift, color, size}
– Specific: {ra2000, dec2000, r, z, g-r, R_eff }
2008 NVO Summer School
32
Key Concept for Data Mining:
Data Types
• Different data types:
– Continuous:
• Numeric (e.g., salaries, ages, temperatures, rainfall, sales)
– Discrete:
• Binary (0 or 1; Yes/No; Male/Female)
• Boolean (True/False)
• Specific list of allowed values (e.g., zip codes; country names; chemical
elements; amino acids; planets)
– Categorical:
• Non-numeric (character/text data) (e.g., people’s names)
• Can be Ordinal (ordered) or Nominal (not ordered)
• Reference: http://www.twocrows.com/glossary.htm#anchor311516
• Examples of Data Mining Classification Techniques:
– Regression for continuous numeric data
– Logistic Regression for discrete data
– Bayesian Classification for categorical data
2008 NVO Summer School
33
Key Concept for Data Mining:
Data Normalization
& Data Transformation
• Data Normalization transforms data values
for different database attributes into a
uniform set of units or into a uniform scale
(i.e., to a common min-max range).
• Data Normalization assigns the correct
numerical weighting to the values of
different attributes.
• For example:
– Transform all numerical values from min to
max on a 0 to 1 scale (or 0 to Weight ; or -1
to 1; or 0 to 100; …).
– Convert discrete or character (categorical)
data into numeric values.
– Transform ordinal data to a ranked list
(numeric).
– Discretize continuous data into bins.
2008 NVO Summer School
34
Key Concept for Data Mining:
Similarity and Distance Metrics
• Similarity between complex data objects is
one of the central notions in data mining.
• The fundamental problem is to determine
whether any selected pair of data objects
exhibit similar characteristics.
• The problem is both interesting and
difficult because the similarity measures
should allow for imprecise matches.
• Similarity and its inverse – Distance –
provide the basis for all of the fundamental
data mining clustering techniques and for
many data mining classification techniques.
2008 NVO Summer School
35
Similarity and Distance
Measures (metrics)
2008 NVO Summer School
36
Similarity and Distance Measures
• Most clustering algorithms depend on a distance or similarity
measure, to determine (a) the closeness or “alikeness” of cluster
members, and (b) the distance or “unlikeness” of members from
different clusters.
• General requirements for any similarity or distance metric:
– Non-negative: dist(A,B) > 0 and sim(A,B) > 0
– Symmetric: dist(A,B)=dist(B,A) and sim(A,B)=sim(B,A)
• In order to calculate the “distance” between different attribute
values, those attributes must be transformed or normalized
(either to the same units, or else normalized to a similar scale).
• The normalization of both categorical (non-numeric) data and
numerical data with units generally requires domain expertise.
This is part of the pre-processing (data preparation) step in any
data mining activity.
2008 NVO Summer School
37
Popular Similarity and Distance
Measures
• General Lp distance = ||x-y||p = [sum{|x-y|p}]1/p
• Euclidean distance: p=2
– DE = sqrt[(x1-y1)2 + (x2-y2)2 + (x3-y3)2 + … ]
• Manhattan distance: p=1 (# of city blocks walked)
– DM = |x1-y1| + |x2-y2| + |x3-y3| + …
• Cosine distance = angle between two feature vectors:
– d(X,Y) = arccos [ X ٠ Y / ||X|| . ||Y|| ]
– d(X,Y) = arccos [ (x1y1+x2y2+x3y3) / ||X|| . ||Y|| ]
• Similarity function: s(x,y) = 1 / [1+d(x,y)]
2008 NVO Summer School
.
8
– s varies from 1 to 0, as distance d varies from 0 to
38
Data Mining Clustering and
Nearest Neighbor Algorithms – Issues
• Clustering algorithms and nearest neighbor algorithms (for
classification) require a distance or similarity metric.
• You must be especially careful with categorical data, which
can be a problem. For example:
– What is the distance between blue and green?
Is it larger than the distance from green to
red?
– How do you “metrify” different attributes
(color, shape, text labels)? This is essential
in order to calculate distance in multidimensions. Is the distance from blue to
green larger or smaller than the distance
from round to square? Which of these are
most similar?
2008 NVO Summer School
39
Key Concept for Data Mining:
Classification Accuracy
Typical Error Matrix:
True Positive
False Negative
False Positive
True Negative
TRAINING DATA (actual classes)
Class-A
Class-B
Totals
Class-A
2834
(TP)
173
(FP)
3007
Class-B
318
(FN)
3103
(TN)
3421
Totals
3152
3276
6428
2008 NVO Summer School
40
Typical Measures of Accuracy
•
•
•
•
•
Overall Accuracy
Producer’s Accuracy (Class A)
Producer’s Accuracy (Class B)
User’s Accuracy (Class A)
User’s Accuracy (Class B)
=
=
=
=
=
(TP+TN)/(TP+TN+FP+FN)
TP/(TP+FN)
TN/(FP+TN)
TP/(TP+FP)
TN/(TN+FN)
Accuracy of our Classification on preceding slide:
•
•
•
•
•
Overall Accuracy
Producer’s Accuracy (Class A)
Producer’s Accuracy (Class B)
User’s Accuracy (Class A)
User’s Accuracy (Class B)
=
=
=
=
=
2008 NVO Summer School
92.4%
89.9%
94.7%
94.2%
90.7%
41
Key Concept for Data Mining:
Overfitting
d(x)
• g(x) is a poor fit (fitting a straight line through the points)
• h(x) is a good fit
• d(x) is a very poor fit (fitting every point) = Overfitting
2008 NVO Summer School
42
How to Avoid Overfitting in Data Mining Models
• In Data Mining, the problem arises because you are training the
model on a set of training data (i.e., a subset of the total
database).
• That training data set is simply intended to be representative of
the entire database, not a precise exact copy of the database.
• So, if you try to fit every nuance in the training data, then you
will probably over-constrain the problem and produce a bad fit.
• This is where a TEST DATA SET comes in very handy. You can
train the data mining model (Decision Tree or Neural Network)
on the TRAINING DATA, and then measure its accuracy with the
TEST DATA, prior to unleashing the model (e.g., Classifier) on
some real new data.
• Different ways of subsetting the TRAINING and TEST data sets:
• 50-50 (50% of data used to TRAIN, 50% used to TEST)
• 10 different sets of 90-10 (90% for TRAINING, 10% for TESTING)
2008 NVO Summer School
43
Schematic Approach to Avoiding Overfitting
Test Set error
Error
Training
Set error
To avoid overfitting, you
need to know when to stop
training the model.
Although the Training Set
error may continue to
decrease, you may simply be
overfitting the Training Data.
Test this by applying the
model to Test Data (not part
of Training Set). If the Test
Set error starts to increase,
then you know that you are
overfitting the Training Set
and it is time to stop!
Training Epoch
STOP Training HERE !
2008 NVO Summer School
44
OUTLINE
•
•
•
•
•
The New Face of Science
Scientific Knowledge Discovery
Data Mining Examples and Techniques
Basic Concepts in Data Mining
What’s next?
2008 NVO Summer School
45
Scientific Data Mining
in Astronomy
2008 NVO Summer School
46
Download