Document 13359813

advertisement
Overview
Data Mining
Key idea: Finding Structure in Data
$
Chris Williams
The Data Mining Process
$
Institute for Adaptive and Neural
– Predictive Modelling
Computation
!
– Descriptive Modelling
Scientific Data Mining
$
24 October 2002
$
N I V E R
Probabilistic Modelling
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
Division of Informatics
University of Edinburgh, UK
What is data mining?
The Data Mining Process
Data mining is the analysis of (often large)
observational data sets to find
unsuspected relationships and to
summarize the data in novel ways that are
both understandable and useful to the
data owner. Hand, Mannila, Smyth
Evaluation and Knowledge
Presentation
Data Mining
[Data mining is the] extraction of
interesting (non-trivial, implicit, previously
unknown and potentially useful)
information or patterns from data in large
databases. Han
We are drowning in data, but starving for
knowledge! Han
Selection and
Transformation
Data
warehouse
Cleaning and
Integration
Databases
Flat files
Figure from Han and Kamber
Patterns
Models and Patterns
Data Mining Tasks
Exploratory Data Analysis
/
A model structure is a global summary of the
data set.
Example: linear regression, makes a prediction
for all input values
#
Descriptive Modelling
/
– Probabilisty density estimation
– Cluster analysis/segmentation
Predictive Modelling: Classification and Regression
/
Discovering Patterns
Pattern structures make statements only about
restricted regions of the space spanned by the
variables.
Example:
#
/
– Association rules
– Outlier detection
Mining Complex Types of Data
/
"
$
&
– Retrieval by Content (RBC) for text, images
(
Equivalently
Example: detection of outliers
+
"
$
&
.
– Time series and sequence data
– Spatial data
– Text mining
Definition from Hand, Mannila, Smyth (2001)
#
Predictive Modelling
– Mining the WWW (content, structure, usage)
Example methods
Neural Networks
/
Learning from input-output pairs
#
Decision Trees
/
Nearest-neighbour methods
/
Predict output(s) given inputs
#
/
Examples
#
Supervised learning is
– SKICAT (JPL/Caltech): Predict if an
astronomical object is a star or a galaxy
Always inductive (from the specific to the
general). Need inductive bias
#
– Predicting disease presence/absence
based on gene expression data
Learning as search: usually we have a class
of functions, and wish to find suitable
candidates within the class
#
– PapNet – searching for abnormal cells in
Pap smears
#
Classification and regression problems
Support Vector Machines
#
Key issue is generalization, ability to make
predictions for new inputs
Descriptive Modelling
Some lower dimensional structures in a higher-dimensional
space e.g.
Cluster centres (points in 0-d)
Task is to discover significant patterns or
features in the input data, without a teacher;
self-organization
#
... .
. .
. .X .
. .. .
.
. .. . ..
.
.. X
.. ..
No external teacher or critic, but often an
internal quality measure is optimized
#
Examples
#
Lower-dimensional manifolds, e.g. lines, sheets (1-d, 2-d)
...
. .
.
. . ...
. . . .. .
.
.
.
.
..
. .
.
. . .
.
.
– Clustering
– Dimensionality reduction/fitting a
lower-dimensional manifold
Computational Environments
Scientific Data Mining (SDM)
Data mining is greatly facilitated by having a
computational environment in which many
operations can be carried out, pipelined,
evaluated and visualized etc
#
Sometimes standard data mining techniques
will be enough e.g. SKICAT, Clustering of
Regimes in the Earth’s Upper Atmosphere
#
Domain knowledge can be introduced by the
choice and construction of appropriate features
if these are known
#
Some data mining examples: WEKA,
Clementine
#
#
.
. . .. ..
.. X..
. . ..
Some general scientific computational
environments, e.g. MATLAB, IDL
However, many standard machine learning
methods encode only vague notions of prior
knowledge
#
To what extent do we need to incorporate
domain knowledge in SDM?
#
#
How useful are general-purpose data mining
tools for SDM?
Clustering of regimes in the Earth’s
Upper Atmosphere
Probabilistic Modelling
A tool for modelling (possibly) complex
networks of relationships which are
non-deterministic.
#
Smyth, Ghil, Ide (1998), see
#
Observations of geopotential height on a
spatial grid of 500 points recorded twice daily
since 1948.
#
A directed graphical structure. Joint probability
is defined as
#
*
!
!
"
$
$
$
"
&
(
*
+
-
/
0
1
3
Reduced using PCA, then clustered
#
#
Showed good agreement with three
well-known maps in atmospheric science
Example 1: Does my car start?
Use of probability theory as a calculus of
uncertainty
#
P(f=empty) = 0.05
P(b=bad) = 0.02
Battery
Fuel
Graphical descriptions are used to define
(in)dependence
#
Gauge
Turn Over
P(g=empty|b=good, f=not empty) = 0.04
P(g=empty| b=good, f=empty) = 0.97
P(g=empty| b=bad, f=not empty) = 0.10
P(g=empty|b=bad, f=empty) = 0.99
P(t=no|b=good) = 0.03
P(t=no|b=bad) = 0.98
Heckerman (1995)
Probabilistic expert systems (can deal
comfortably with exceptions), cf rule-based
systems
Start
#
Probabilistic expert systems for medical
diagnosis (e.g. QMR-DT, MUNIN)
#
P(s=no|t=yes, f=not empty) = 0.01
P(s=no|t=yes, f=empty) = 0.92
P(s=no| t = no, f=not empty) = 1.0
P(s=no| t = no, f = empty) = 1.0
Can learn probability models from data
#
A framework for dealing with hidden-cause
(latent variable) models
#
Association for Uncertainty in AI
#
4
Hidden Markov Models
q0
π0
q1
A
y0
q2
A
y1
q3
A
y2
..
Summary
qΤ
y3
“Standard view” of data mining, including
visualization, predictive and descriptive
modelling
#
yΤ
Scientific data mining—how much do we need
to go beyond the “standard view”?
#
Models for sequences (biological, temporal)
#
#
Examples in Scientific Data Mining
#
– Hidden Markov Models for gene finding in
DNA (profile HMMs)
– Condition monitoring in antenna control
systems for NASA (Smyth)
Data Mining: Relationships to
Other Fields
Statistics
#
Machine Learning
#
Database technology
#
Visualization
#
...
#
Relationship of Machine Learning to Data Mining
Machine Learning is concerned with making
computers that learn things for themselves.
#
#
Data mining is more concerned with enabling
humans to learn from data
Belief networks as one way of modelling
complex networks of non-deterministic
relationships
Download