Classification vs. Clustering

advertisement
Classification vs. Clustering
• Data Mining Lecture Notes
• Dr. Yousef Qawqzeh
Clustering and Classification
Concepts
Bayesian Classification
Clustering Algorithms
Applications
Classifying Mitochondrial Proteins
Clustering Gene Expression Data
Lab Practical
Clustering and Classification of Gene Expression Data
TB Expression Data Compendium (Boshoff et al)
• Objects characterized by one or more
features
• Classification
– Have labels for some points
– Want a “rule” that will accurately assign
labels to new points
– Supervised learning
FeatureinYExp 2
Expression
Conservation
The Basic Idea
Proteins
Genes
Expression
Co-Expression
FeatureinXExp 1
• Clustering
– No labels
– Group points into clusters based on how
“near” they are to one another
– Identify structure in data
– Unsupervised learning
Clustering and Classification in Genomics
• Classification
 Microarray data: classify cell state (i.e. AML vs ALL) using
expression data
 Protein/gene sequences: predict function, localization, etc.
• Clustering
 Microarray data: groups of genes that share similar function
have similar expression patterns – identify regulons
 Protein sequence: group related proteins to infer function
 EST data: collapse redundant sequences
Classification
Two Different Approaches
• Generative
– Bayesian Classification and Naïve Bayes
– Example: Mitochondrial Protein Prediction
• Discriminative
– Support Vector Machines
Bayesian Classification
We will pose the classification problem in
probabilistic terms
Create models for how features are
distributed for objects of different classes
We will use probability calculus to make
classification decisions
Classifying Mitochondrial Proteins
Derive 7 features for all
human proteins
Targeting signal
Protein domains
Co-expression
Mass Spec
Homology
Induction
Motifs
Predict nuclear encoded
mitochondrial genes
Maestro
• Each object can
be associated
with multiple
features
• We will look at
the case of just
one feature for
now
Conservation
Lets Look at Just One Feature
Proteins
Co-Expression
We are going to define two key concepts….
The First Key Concept
Features for each class drawn from class-conditional
probability distributions (CCPD)
P(X|Class1)
P(X|Class2)
X
Our first goal will be to model these distributions
This Seems Familiar….
Modeling Pathogenicity Islands
P(Si|MP) is a class-conditional
probability distribution
P(Si|MP)
Objects:
Class:
Feature:
Nucleotides
Pathogenicity Island
A,T,G,C
A: 0.15
T: 0.13
G: 0.30
C: 0.42
Makes sense: we built a simple classifier for sequences (MP vs B)
The Second Key Concept
We model prior probabilities to quantify the expected a
priori chance of seeing a class
P(Class2) & P(Class1)
P(mito) = how likely is the next protein to be a mitochondrial protein before I
see any features to help me decide
We expect ~1500 mitochondrial genes out of ~21000 total, so
P(mito)=1500/21000
P(~mito)=19500/21000
But How Do We Classify?
• So we have priors defining the a priori probability
of a class
P(Class1), P(Class2)
• We also have models for the probability of a
feature given each class
P(X|Class1), P(X|Class2)
But we want the probability of the class given a feature
How do we get P(Class1|X)?
Bayes Rule
Evaluate
Likelihood
evidence
Belief
Prior
before
evidence
P( Feature | Class) P(Class)
P(Class | Feature) 
P( Feature)
Posterior
Belief after
evidence
Evidence
Bayes, Thomas (1763) An essay
towards solving a problem in the
doctrine of chances. Philosophical
Transactions of the Royal Society of
London, 53:370-418
Bayes Decision Rule
If we observe an object with feature X, how do decide if the object is
from Class 1?
The Bayes Decision Rule is simply choose Class1 if:
P(Class1| X )  P(Class 2 | X )
P( X | Class1) P( L1) P( X | Class 2) P( L2)

P( X )
P( X )
P( X This
| Class
P(Class
 P( X |on
Class
2) P
(Class 2)
is1)
the
same1)number
both
sides!
Discriminant Function
We can create a convenient representation of the
Bayes Decision Rule
P( X | Class1) P (Class1)  P ( X | Class 2) P (Class 2)
P ( X | Class1) P (Class1)
1
P ( X | Class 2) P (Class 2)
G ( X )  log
P( X | Class1) P (Class1)
0
P ( X | Class 2) P (Class 2)
If G(X) > 0, we classify as Class 1
Stepping back
What do we have so far?
We have defined the two components, class-conditional
distributions and priors
P(X|Class1), P(X|Class2)
P(Class1), P(Class2)
We have used Bayes Rule to create a discriminant function for
classification from these components
G ( X )  log
P( X | Class1) P(Class1)
0
P( X | Class 2) P(Class 2)
Given a new feature, X, we plug
it into this equation…
…and if G(X)> 0 we classify as Class1
Flashback to First Lecture
We defined a scoring function for classifying sequences
as pathogenicity islands or background DNA
Can we justify this choice in Bayesian terms?
Score  log
Pathogenicity
Islands
P( S | MB)
P( Sequence | PathogenicityIsland )
 log
P( S | B)
P( Sequence | BackgroundDNA)
A: 0.42
T: 0.30
Background
DNA
A: 0.25
T: 0.25
G: 0.13
G: 0.25
C: 0.15
C: 0.25
From the Bayesian Approach
We can write out our Bayes discriminant function:
Bayes
Rule
P( MB | S )
P( S | MB) P( MB)
G ( S )  log
 log
P( B | S )
P( S | B) P( B)
P( S | MB)
 log
If P(MB)/P(B)=1
P(MB)=P(B)
P( S | B)
Log-likehood ratio is Bayes classifier assuming
equal priors
Unequal Priors
Bayes rule lets us correct for the fact that islands are
not as common as background DNA
P( S | MB) P( MP)
P( S | MB)
0.15
 log
 log
P( S | B) P( B)
P( S | B)
0.85
 Scoreoriginal  0.75
G ( X )  log
We require a higher score to call a pathogenicity island
(The HMM implicitly accounted for this in the transition probabilities)
Pathogenicity
Islands
Background
DNA
A: 0.42
T: 0.30
G: 0.13
C: 0.15
15%
A: 0.25
T: 0.25
G: 0.25
C: 0.25
85%
Back to Classification
We have two fundamental tasks
• We need to estimate the needed probability
distributions
– P(X|Mito) and P(x|~Mito)
– P(Mito) and P(~Mito)
• We need to assess the accuracy of the
classifier
– How well does it classify new objects
The All Important Training Set
Building a classifier requires a set of labeled data points called
the Training Set
The quality of the classifier depends on the number of training
set data points
How many data points you need depends on the problem
Need to build and test your classifier
Getting P(X|Class) from Training Set
P(X|Class1)
One Simple Approach
How do we get this
Divide X values into bins
from these?
There are 13 data
points
And then we simply count
frequencies
Should feel familiar from
last lecture
(i.e. Maximum Likelihood)
X
In general, and especially for continuous
distributions,
7/13
this can be a complicated problem
3/13
Density Estimation
2/13
1/13
0
<1
1-3
3-5
5-7
>7
Getting Priors
Three general approaches
1.
Estimate priors by counting
fraction of classes in training
set
P(Class1)=13/23
P(Class2)=10/23
13 Class1
2.
10 Class2
Estimate from “expert”
But sometimes fractions in training set
are not
knowledge
Example
representative of world
P(mito)=1500/21000
P(~mito)=19500/21000
3.
We have no idea – use equal
(uninformative) priors
P(Class1)=P(Class2)
We Are Just About There….
We have created the class-conditional distributions and priors
P(X|Class1), P(X|Class2)
P(Class1), P(Class2)
And we are ready to plug these into our discriminant function
P( X | Class1) P(Class1)
G ( X )  log
0
P( X | Class 2) P(Class 2)
But there is one more little complication…..
But What About Multiple Features?
• We have focused on a single
feature for an object
• But mitochondrial protein
prediction (for example) has
7 features
Targeting signal
Protein domains
Co-expression
Mass Spec
Homology
Induction
Motifs
So P(X|Class) become P(X1,X2,X3,…,X8|Class)
and our discriminant function becomes
P( X 1 , X 2 ,..., X 7 | Class1) P(Class1)
G( X )  log
0
P( X 1 , X 2 ,..., X 7 | Class 2) P(Class 2)
Distributions Over Many Features
Estimating P(X1,X2,X3,…,X8|Class1) can be difficult
• Assume each feature binned into 5
possible values
• We have 58 combinations of values we
need to count the frequency for
• Generally will not have enough data
– We will have lots of nasty zeros
Naïve Bayes Classifier
We are going to make the following assumption:
All features are independent given the class
P( X 1 , X 2 ,..., X n | Class )  P( X 1 | Class ) P( X 2 | Class )...P( X n | Class)
n
  P( X i | Class )
i 1
We can thus estimate individual distributions for each
feature and just multiply them together!
Naïve Bayes Discriminant Function
Thus, with the Naïve Bayes assumption, we can now
rewrite, this:
G( X 1 ,..., X 7 )  log
P( X 1 , X 2 ,..., X 7 | Class1) P(Class1)
0
P( X 1 , X 2 ,..., X 7 | Class 2) P(Class 2)
As this:
G ( X 1 ,..., X 7
P( X | Class1) P(Class1)

)  log
0
 P( X | Class 2) P(Class2)
i
i
Individual Feature Distributions
Instead of a single big distribution, we have a smaller
one for each feature (and class)
P(Target|Mito)
P(Target|~Mito)
Targeting signal
P(Domain|Mito)
P(Domain|~Mito)
Protein domains
P(CE|Mito)
P(CE|~Mito)
Co-expression
P(Mass|Mito)
P(Mass|~Mito)
Mass Spec
P(Homology|Mito)
P(Homology|~Mito)
Homology
P(Induc|Mito)
P(Induc|~Mito)
P(Motif|Mito)
P(Motif|~Mito)
Target
7/13
2/13
1/13
Motifs
0
<1
Induction 3/13
1-3
3-5
5-7
>7
Classifying A New Protein
Targeting signal
Protein domains
Co-expression
Xi
Mass Spec
Homology
Induction
P(Xi|Mito)
P(Xi|~Mito)
(for all 8 features)
Motifs
Plug these and priors into the discriminant function
G ( X 1 ,..., X 7
P( X | Mito) P( Mito)

)  log
0
 P( X |~ Mito) P(~ Mito)
i
i
IF G>0, we predict that the protein is from class Mito
Maestro Results
Apply Maestro to Human Proteome
Total predictions: 1,451 proteins
490 novel predictions
Slide Credit: S. Calvo
How Good is the Classifier?
The Rule
We must test our classifier on a different
set from the training set: the labeled test
set
The Task
We will classify each object in the test set
and count the number of each type of
error
Binary Classification Errors
Predicted True
Predicted False
True (Mito)
False (~Mito)
TP
FN
FP
TN
Sensitivity = TP/(TP+FN)
Specificity = TN/(TN+FP)
• Sensitivity
– Fraction of all Class1 (True) that we correctly predicted at Class 1
– How good are we at finding what we are looking for
• Specificity
– Fraction of all Class 2 (False) called Class 2
– How many of the Class 2 do we filter out of our Class 1 predictions
In both cases, the higher the better
Maestro Outperforms Existing Classifiers
Naïve Bayes
(Maestro)
(99%, 71%)
**
Slide Credit: S. Calvo
Clustering Expression Data
Clustering Expression Data
– Group by similar
expression profiles
Gene 2
• Cluster Experiments
Experiment
• Cluster Genes
– Group by similar
expression in different
conditions
Experiment 2
Gene 1
Genes
Experiment 1
Why Cluster Genes by Expression?
• Organize data
– Explore without getting lost
in each data point
– Enhance visualization
GCN4
• Co-regulated Genes
– Common expression may
imply common regulation
– Predict cis-regulatory
promoter sequences
• Functional Annotation
– Similar function from
similar expression
His2
His3
Unknown
Amino Acids
Amino Acids
Amino Acids
Clustering Methods
There are many clustering methods, we will
look briefly at two
1.
2.
K-means - Partitioning Method
Hierarchical clustering - Agglomerative Method
K-Means Clustering
The Basic Idea
• Assume a fixed number of clusters, K
• Goal: create “compact” clusters
K-Means Algorithm
• Randomly
Initialize Clusters
Slide credit: Serafim Batzoglou
K-Means Algorithm
• Randomly
Initialize Clusters
• Assign data
points to nearest
clusters
Slide credit: Serafim Batzoglou
K-Means Algorithm
• Randomly
Initialize Clusters
• Assign data
points to nearest
clusters
• Recalculate
Clusters
Slide credit: Serafim Batzoglou
K-Means Algorithm
• Randomly
Initialize Clusters
• Assign data
points to nearest
clusters
• Recalculate
Clusters
Slide credit: Serafim Batzoglou
K-Means Algorithm
• Randomly
Initialize Clusters
• Assign data
points to nearest
clusters
• Recalculate
Clusters
• Repeat…
Slide credit: Serafim Batzoglou
K-Means Algorithm
• Randomly
Initialize Clusters
• Assign data
points to nearest
clusters
• Recalculate
Clusters
• Repeat…
Slide credit: Serafim Batzoglou
K-Means Algorithm
• Randomly
Initialize Clusters
• Assign data
points to nearest
clusters
• Recalculate
Clusters
• Repeat…until
convergence
Slide credit: Serafim Batzoglou
(Dis)Similarity Measures
D’haeseleer (2005) Nat Biotech
But How many clusters?
• How do we select K?
– We can always make clusters “more
compact” by increasing K
– e.g. What happens is if K=number of data
points?
– What is a meaningful improvement?
• Hierarchical clustering side-steps this
issue
Hierarchical clustering
Most widely used algorithm for
expression data
c
a
• Start with each point in a separate
cluster
• At each step:
b
d
h
e
f
g
– Choose the pair of closest clusters
– Merge
Phylogeny (UMPGA)
a
b
d
e
f
c g
h
slide credits: M. Kellis
Visualization of results
Hierarchical clustering
Produces clusters for all possible
numbers of clusters
c
a
b
From 1 to number of data points
d
h
e
f
We can always select a “cut
level” to create disjoint clusters
g
But how do we define distances
between clusters?
a
b
d
e
f
c g
h
slide credits: M. Kellis
Distance between clusters
• CD(X,Y)=minx X, y Y D(x,y)
Single-link method
d
h
e
f
g
• CD(X,Y)=maxx X, y Y D(x,y)
Complete-link method
• CD(X,Y)=avgx X, y Y D(x,y)
Average-link method
• CD(X,Y)=D( avg(X) , avg(Y) )
Centroid method
d
h
e
f
g
d
h
e
f
g
d
h
e
f
g
slide credits: M. Kellis
What does clustered data look like?
Evaluating Cluster Performance
In general, it depends on your goals in clustering
• Robustness
– Select random samples from data set and cluster
– Repeat
– Robust clusters show up in all clusters
• Category Enrichment
– Look for categories of genes “over-represented” in
particular clusters
Next Lecture
– Also used in Motif Discovery
Clusters and Motif Discovery
Expression from
15 time points
during yeast
cell cycle
Tavazoie & Church (1999)
Bringing Clustering and Classification Together
Semi-Supervised Learning
Common Scenario
• Few labeled
• Many unlabeled
• Structured data
What if we cluster first?
Then clusters can help
us classify
Tomorrow’s Lab
• Expression Clustering and Classification
– SVM
– Cross-Validation
– Matlab
• TB Expression Compendium
Download