CLASSIFICATION DISCRIMINATION - Department of Mathematics

advertisement
CLASSIFICATION
DISCRIMINATION
LECTURE 15
What is Discrimination or Classification?
• Consider an example where we have two populations P1
and P2 each ~ N(m1,s1) and N(m2,s2) respectively.
• A new observation is observed and it is known to come
from either of these populations.
• The task of a discriminant function is to determine a “rule”
to decide from which of the two populations x is most
likely to come from.
• How we come up with a rule is what we need to study.
Supervised Learning
• In computer Science this is known as SUPERVISED
learning.
• Essentially we know the class labels ahead of time.
• What we need to do is find a RULE using features in the
data that DISCRIMINATES effectively between the
classes.
• So that if we have a new observation with its features we
can correctly classify it.
Example 1
• Suppose you are a doctor considering two different
anesthetics for a patient.
• You have some information about the patient, gender, age,
some medical history variables.
• So what we need is a data set where we have patient
information and whether or not the anesthetic was SAFE
for that patient.
• So what you want to do is USING the available variables
build a MODEL or RULE that says whether anesthetic A
or B is better for the patient.
• Then use this rule to decide whether or not to give the new
patient A or B.
Example 2: Turkey Thief
• There was this legal case in Kansas where a turkey farmer accused his
neighbor of stealing turkeys from the farm.
• When the neighbor was arrested and the police looked in the freezer,
there were multiple frozen turkeys there.
• The accused claimed these were WILD turkey that he had caught.
• The Statistician was called in to give evidence as there are some
biological differences between domestic and wild turkey.
• So the biologist measured the bones and other body characteristic of
the domestic and Wild turkeys and the Statistician built a
DISCRIMANT function.
• They used the classification function to see if the turkeys in the freezer
fell into he WILD or DOMESTIC class.
• THEY ALL fell in the DOMESTIC classification!
The Idea
• USING knowledge of the classes we build the
FUNCTION.
• We want to minimize misclassification error.
• Question: Should we use ALL the data to build the
MODEL, because then we really do not have a good way
to find out the misclassification probabilities.
• Generally: Training set and Testing sets are used.
Some common Statistical Rules
• Suppose we want to classify between two multivariate
normal distribution P1 with parameters m1 and S1 and P2
with parameters m2 and S2.
• Suppose a new observation vector x is known to come
from P1 or P2.
• There are various Statistical Rules allow us to PREDICT
which population x most likely came from.
1. Likelihood Rule
Choose P1
if L(x,m1,s1) > L(x,m2,s2)
else choose P2.
Here, x is the observation vector.
This is a mathematical rule and reasonable under
the assumption of normality.
2. Linear Discriminant Function (LDA)rule:
Choose P1
if b’x – k > 0
and P2 otherwise.
Here b= S-1(m1-m2) and k=1/2(m1-m2)S-1(m1+m2)
The function b’x is called the linear discriminant function.
This assumes equal covariance matrices S1=S2=S.
It’s a single linear function of x that summarizes all the
information in x.
3. Mahalanobis Distance Rule
Choose P1
if d1 < d2
where di = (x-mi)S-1(x-mi) for i=1,2.
The function di is a measure of how far away x is from mi
taking the Variance-Covariance into account.
This assumes equal covariance matrices S1=S2=S.
The Likelihood criterion under normality and equal
variance is equivalent to this Rule.
4. Posterior probability rule
Choose P1 if
P(P1|x)>P(P2|x)
where,
P(Pi|x) = exp[(-1/2)di]/{exp[(-1/2)d1] + exp[(-1/2)d2] }
• Also assumes equal variance.
• Not a true probability as (P1|x) is not a random event as
the observation belongs to either P1 or P2.
• Gives an idea of how confident we are in our effort to
discriminate.
Caveats
Generally mi and si are not known and we use
sample values.
Under equal covariance all 4 rules are equivalent
in terms of discrimination between groups.
Also in general we have more than 2 populations
to discriminate the observations into.
Sample Discriminant Rules
• Since we never know the parameters m1, m2, S1, S2. we
use sample estimates generally MLE estimates below and
form discrimant rules as in given before.
• x ,x ,S ,S
1 2 1
2
( N1 - 1) S1 + ( N 2 - 1) S 2
Pooled S =
N1 + N 2 - 2
Estimating Probability of Misclassification
• 1. Re-substitution Estimates:
Apply the discriminant function to the data used to
develop the rule and see how well it discriminates in
general.
USES the SAME data to make and validate models.
Holdout Data:
Keep a part of the data out from the part used to construct
the rule and use the rule on that part and see how well it
performs.
Problem is: if you don’t have a lot of samples its not the
most efficient use of resources for building the model.
Cross Validation:
Remove one observation at a time from the set, and
construct the rule from the remaining observations and
predict the first, do this for the second and third…
Define a summary matrix for misclassifying each data
point.
Also called Jack-knifing.
• Obviously a rule classifying correctly a HIGHER number
of times is preferred.
The Issue for MA
• Often it is known in advance WHERE the samples come
from and what conditions they have been exposed to.
• In fact we are often interested in gene expression profiles
to distinguish between different conditions or classes.
• In the past schemes like a voting scheme was used to look
at class membership in MAs.
• MANY MANY methods available, but general consensus
is that a few of the methods have robust performance e.g.
Linear discriminant Function (LDA), k-Nearest Neighbors
(k-NN).
Cost Function and Prior Probabilities
• When we there are only two populations all the four rules
discussed earlier have the property that probability of
misclassifying 1 to 2 is the same as 2 to 1.
• NOT generally a good idea especially in our anesthetic
example. Idea is if you are going have to err, err in the
side of caution.
• Hence we need to take into account the COST of
misclassification.
Some Math Details
•
•
•
•
•
Define U = b’x-k from LDA.
U=(m1-m2)’S-1x - .5 (m1-m2)’S-1 (m1+m2)
Under Normality and equal variance,
if x comes from P1, U ~ N(d,d)
and if x comes from P2, U ~ N(-d,d)
• Where d =(m1-m2)’S-1 (m1-m2)
• And our Rule for LDA is P1 if U > 0 and p2 otherwise.
• To make it asymmetric you can use a rule U > u where we can pick the
probability of misclassifying into one of the populations at most a
fixed number say alpha.
A General Rule
• Define Cost Function as C(i|j) the cost of misclassifying an observation
from Pj to Pi.
• Define Prior probability as pi for the ith group.
• Average Cost of Misclassification (two groups)
• p1C(2|1)P(2|1) + p2C(1|2)P(1|2)
• Bayes Rule: Choose P1
• if p2f(x;q2)C(1|2) > p1f(x;q1)C(2|1)
• Observe if p1=p2 and C(2|1)=C(1|2) this reduces to the Likelihood
rule.
• Under Normality and equal variance it reduces to:
• d1* < d2* where d1* = .5(x-m1)’S-1 (x-m1) – log(p1.C(2|1))
Probabilistic Classification Theory (PCT)
• Most classification methods can be described as special
implementations of Bayes’ Classifiers. The decision rule
for classifying x into one of the classes P1…,Pk depends
upon:
– Prior information about the class frequencies p1…pk.
– Information about how the class membership affects the gene
expression profiles xi (i=1…n)
– Misclassification costs C(j,i) of classifying an observation which
belongs to class Pi into Pj.
• Our aim is to find a classification rule R that minimizes the
expected Classification Costs.
PCT II: Bayes Rule
• Recall Cost of Misclassification is given by:
• C(j|i) = 0 if i=j
•
= Ci , if i j (generally Ci is set to 1)
• Result: the classification rule that minimizes the expected
misclassification cost is given by the posterior probability:
• R(x) = arg Min P(C|x) = arg Min P(x|C)pc
• This is called the Bayes Rule.
PCT III: Prior Information
• Hence the idea is: IF we know the Probability of Class
membership pc, and the conditional probability of the data
given the classifiers P(x|C), we can find the optimum
Classification Rule.
• In general it is VERY difficult to KNOW the prior
information about class membership.
• To find P(x|C) the Likelihood of the data, we often use the
Normal distribution (or log-transformed gene expression to
be Normal). This is done in the Training set.
Steps in Discriminant Analysis in MA
• Selection of features:
• Model Fitting
• Model Validation:
Selection of Features
Selecting a set of genes. We do not want all the genes since it may
have a tendency to over-fit the data also causes singularity.
How to select genes (gene filtering):
– Use ONLY differentially expressed genes using an ANOVA type
model: xi = a C(xi) + ei
– Look at multiple genes or gene groups. Do PCA on all the genes.
Not very efficient
– Partial least Squares(PLS), finds orthogonal linear combinations
that maximize Cov (Xl,y).
– Do PCA and then rank PCAs by ratio of between class to within
class varaince
– Other methods are Projection pursuit etc.
Most common differential expression or PLS
MODEL FITTING
• Commonly used:
• LDA
• K Nearest Neighbor
• Other related
• DLDA (Diagonal LDA)
• RDA (Regularized DA) (there is a R package for this)
• PAM (Prediction Analysis for Microarrays) (there is a R package
for this)
• FDA (Flexible DA)
Validation
• See how well the classifiers classify the observations into
the different classes.
• Mostly commonly used method leave-one-out-cross
validation.
• Though test data set (holdout sample) and resubmissions
are still used.
Linear Discriminant Analysis(LDA)
•
•
•
•
Easy useful method.
Been found to be robust in MA.
Idea:
The main assumption is that the class densities can be written as
Multivariate Normal.
• In R one uses lda in the MASS library.
• Hence,
– P(x| C=k) = MVN ( m1…mk, Skk)
– Maximize : P(C=k| x) ={ P(x| C=k)pk}/S(P(x|C=j)pj
– If feature set is known then it is fairly straight forward, else one has to use
some technique (forward, backward or step-wise) for feature selection.
K-nearest Neighbor (kNN)
• Assumption: samples with almost the same feature should belong to
the same class. In other words given a set of genes (g1,…,gm) known
to be important in class membership, the kNN classifier assigns an
unclassified sample to the class prevalent among the k samples whose
expression values for the m genes are closest in the sample of interest.
• Typically each profile for sample j, is compared to the other profiles
using Euclidean distances (however, any other distance like
Manhattan, Correlation can be useful as well).
• The aim of kNN is to estimate the posterior probability P(C(X)=j|X=x)
of a gene profile belonging to a class directly.
• For a particular k, it estimates the probability as a relative fraction of
samples that belong to class j, among the k samples with most similar
profiles.
• Essentially a non-linear classifier and may have VERY irregular edges.
lda example from R
•
•
•
•
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
+
Sp = rep(c("s","c","v"), rep(50,3)))
> train <- sample(1:150, 75)
> table(Iris$Sp[train])
•
•
•
•
•
c s v
27 24 24
> ## your answer may differ
> ## c s v
> ## 22 23 30
Running lda
•
•
•
•
•
> z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
> predict(z, Iris[-train, ])$class
[1] s s s s s s s s s s s s s s s s s s s s s s s s s s c c c c c c c c c c c c
[39] c c c c c c c c c c c v v v v c v v v v v v v v v v v c v v c v v v v v v
Levels: c s v
Contd…
•
•
•
•
> (z1 <- update(z, . ~ . - Petal.W.))
Call:
lda(Sp ~ Sepal.L. + Sepal.W. + Petal.L., data = Iris, prior = c(1,
1, 1)/3, subset = train)
•
•
•
Prior probabilities of groups:
c
s
v
0.3333333 0.3333333 0.3333333
Contd…
•
•
•
•
•
Group means:
Sepal.L. Sepal.W. Petal.L.
c 5.955556 2.781481 4.359259
s 5.008333 3.450000 1.429167
v 6.637500 2.983333 5.629167
•
•
•
•
•
Coefficients of linear discriminants:
LD1
LD2
Sepal.L. 0.9045765 -0.07677002
Sepal.W. 0.7347963 2.58009411
Petal.L. -3.1529282 0.37700694
•
•
•
Proportion of trace:
LD1 LD2
0.9939 0.0061
knn
•
•
•
•
•
library(class)
> train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
> test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
> cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
> knn(train, test, cl, k = 3, prob=TRUE)
Download