Exploring Multivariate Data •

advertisement
Exploring Multivariate Data
• Supervised Classification: forests, neural
networks
• Clustering: k-means, model-based, selforganizing maps
Supervised Classification
• Build a model to predict the class (group)
for future data
• Examples: spam filters, face recognition, ...
Preparation
• Many data mining methods do not have inbuilt cross-validation techniques
• Error on new data may be higher than
error on model trained data
• Hold out a sample to use for assessing
model
Plot the Data
palmitic
palmitoleic
stearic
oleic
linoleic
linolenic
arachidic
eicosenoic
palmitic
palmitoleic
Scatterplot matrix
stearic
oleic
linoleic
• No obvious clustering
• Some variables strongly
linolenic
associated, eg palmitic,
palmitoleic
arachidic
eicosenoic
Plot the Data
1.0
0.8
area
South Apulia
value
0.6
Sicily
0.4
North Apulia
Calabria
0.2
0.0
palmitic palmitoleic
linoleic
stearic
variable
oleic
linolenic
arachidic eicosenoic
Parallel coordinate plot
• Areas S. Apulia, N. Apulia, Calabria are different from
each other
• Area Sicily overlaps with all
Trees
• Sequentially split the
linoleic< 950.5
|
data to get subsets
that have “pure”
classes
• Training error
14/217=0.065
• Test error
18/106=0.170
palmitoleic>=95.5
stearic>=258
Sicily
linolenic>=37
Calabria
oleic>=7724
South Apulia
North Apulia
Sicily
South Apulia
Random Forests
A random forest is a classifier that is built from
multiple trees, generated by random sampling cases,
and variables. Forests are computationally intensive
but retain some of the interpretability of trees. There
are several parameters that control the algorithm, and
there are numerous diagnostics output by random
forests.
http://www.math.usu.edu/~adele/forests/index.htm is a good site
for more information
Random Forests
• Inputs: number of variables, trees
• Diagnostics returned: Error rate,
importance, ...
Random Forests
• Training error: 11/217=0.051
• Test error: 13/106=0.123
• Importance of variables: linoleic,
al Networks
palmitoleic, oleic, ...
250
on the neuron systems in organisms, where dendrites pass information
area
cal threshold. As the level of a chemical builds up in the neuron
it
South Apulia
fires
the chemical signal
to the
next neuron.
Sicily
8 off
Feed-forward
Neural
Networks
palmitoleic
200
150
North Apulia
Calabria
Neural networks are loosely based on the neuron100systems in organisms, where dendrites pass information
along a network based on a chemical threshold. As the level of a chemical builds up in the neuron it
approaches a threshold at which it fires off the chemical
signal to the next neuron.
50
600
800
1000
linoleic
1200
1400
Neural Networks
Feed-forward neural networks (FFNN) were developed from this concept, that combining small comFFNN)
were developed
from this
concept,
that (FFNN)
combining
comA feed-forward
neural
networks
fitssmall
a
ponents is a way to build a model from predictors to response. They actually generalize linear regression
from
predictors
to response.
They actually
generalize
linear
regression
of
theasform:
functions.
A simplemodel
network
model
produced
by nnet code
in S-Plus may
be represented
by the equation:
as produced by nnet code in S-Plus mays be represented
by the equation:
p
f (x) = φ(α +
s
!
ŷ = f (x) = φ(α +
wh φ(αh +
p
!
!
h=1
wih xi ))
wh φ(αh +
!
wih xi ))
i=1
where x is the vector of explanatory variable values, y is the target value, p is the number of variables, s is
i=1layer and φ is a fixed function, usually a linear or logistic function.
h=1
the number of nodes
single hidden
xin= the
explanatory
variables,
p of them
This model has a single
hidden
layer,
and
univariate
output values.
y = categorical response
variable values,s =ynumber
is the oftarget
value,
p ishidden
the number
of variables, s is
nodes in
the single
layer
dden layer and φ is
a fixed
function,
linearfunction.
or logistic function.
= fixed
function,
usually ausually
linear oralogistic
r, and univariate output values.
Feed-forward neural networks (FFNN) were developed from this concept, that combining small comonents is a way to build a model from predictors to response. They actually generalize linear regression
nctions. A simple network model as produced by nnet code in S-Plus may be represented by the equation:
ŷ = f (x) = φ(α +
s
!
h=1
wh φ(αh +
p
!
wih xi ))
i=1
here x is the vector of explanatory variable values, y is the target value, p is the number of variables, s is
e number of nodes in the single hidden layer and φ is a fixed function, usually a linear or logistic function.
his model has a single hidden layer, and univariate output values.
he response variable can be multivariate. A simple linear regression model, y = w0 +
presented as a feed-forward neural network:
Neural Networks
• Choose number of32 nodes in hidden layer,
and amount of smoothing of boundary, ...
• Fit by minimizing a loss (or error) function
• Difficult to fit, and possible to overfit
• Depends on initial random start
"p
j=1
wi xi , can be
Neural Networks
• After several starts best minimum value is
9.77 (Save this model!!!)
• Training error: 2/217=0.009
• Test error: 12/106=0.113
Your turn
For the music data, using type as the response,
and variables lvar, lave, lmax, lfener, lfreq as
explanatory variables:
• Fit a random forest
• Fit a neural network
Report the training errors, and, for the forest,
which variables are the most important.
Also predict the 5 unlabeled tracks as
either Classical or Rock.
Cluster analysis
• Group cases together according to a
measure of similarity
• Examples: market segmentation, gene
function, ...
ters
s
u
l
c
the
e
r
ts?
a
e
s
t
a
a
t
h
a
W
se d
e
h
t
in
Cluster analysis
t=
x̄1 − x̄2
√
s/ n
• To define similar, need to have a distance
measure
•
Ho : µ1 = µ2
vs
Ha : µ1 #= µ2
ˆ
tip = 0.92 + 0.104 × bill
Euclidean distance
between A, B:
!
(A1 − B1 )2 + (A2 − B2 )2 + . . . + (Ap − Bp )2
Model-based clustering
• Fit a mixture of Gaussians
EII
VII
EEE
EEV
VVV
Plot the Data
glucose
insulin
sspg
glucose
• No obvious
clusters
insulin
• Concentration
sspg
of points, and
two strings of
points
2
4
EII
VVI
VII
EEE
EEI
EEV
VEI
VEV
EVI
VVV
6
Highest BIC value
corresponds to a model
with 3 clusters and
unconstrained (VVV)
variance model.
8
number of components
Model-based Clustering
500
1000
300
0
1000 1500
100
200
glucose
500
insulin
0
sspg
200 400 600
0
BIC
-5800 -5600 -5400 -5200 -5000 -4800
Model-based Clustering
100
200
300
0
200
600
Three clusters
correspond to one
with low values on
all variables (green),
high on glucose and
insulin (blue), high on
sspg (orange).
Self-organizing maps
• Fit k means to data
• Constrain the means to lie on a 2D grid
• Fitting can be tricky! Check the sum of
squared differences between points and
means after fitting.
Self-organizing Maps
Suggests 3 connected
clusters: one with low
values on all variables,
one with high values on
glucose and insulin, one
with high values on sspg.
glucose
insulin
sspg
Your turn
For the music data, use model-based
clustering and self-organizing maps to group
the music clips.
Download