#38- Proteomics 11/28/07 Proteomics BCB 444/544

advertisement
#38- Proteomics
11/28/07
Required Reading
BCB 444/544
(before lecture)
3 √ Mon Nov 26 - Lecture 37
Lecture 38
Clustering & Classification Algorithms
• Chp 18 Functional Genomics
Review: Microarrays
2 Wed Nov 28 - Lecture 38
Proteomics & Protein Interactions
• Chp 19 Proteomics
Proteomics
Thurs Nov 30 - Lab 12
R Statistical Computing & Graphics (Garrett Dancik)
http://www.r-project.org/
#38_Nov28
1 Fri Dec 1 - Lecture 39 (Last Lecture!)
Systems Biology
Thanks to
Doina Caragea, KSU
(& a bit of Metabolomics & Synthetic Biology)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
1
Assignments & Announcements
Mon Nov 26 - HW#6 Due
Mon Dec 3
ALL BCB 444 & 544 students are REQUIRED to attend
ALL project presentations next week!!!
#2: Tonia (10-15’)
#4: Addie (10-15’)
Thurs Dec 6 - Optional Review Session for Final Exam
40 pts In Class: New material (since Exam 2)
20 pts In Class: Comprehensive
40 pts In Lab Practical (Comprehensive)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Sue Gibson Univ. of Minnesota
• How do soluble sugar levels help regulate plant development, carbon
partitioning and gene expression?
11/28/07
Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB
• John Abrams Univ Texas Southwestern Medical Center
• Dying Like Flies: Programmed & Unprogrammed Cell Death
3
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
4
11/28/07
6
Gene Expression Analysis
Chp 18 – Functional Genomics
SECTION V
http://www.bcb.iastate.edu/seminars/index.html
Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium,
• Greg Voth Univ. of Utah
• Multiscale Challenge for Biomolecular Systems: A Systematic Approach
Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Shashi Gadia ComS, ISU
• Harnessing the Potential of XML
Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM)
Will include:
2
BCB List of URLs for Seminars related to Bioinformatics:
(5 PM Mon Nov 26 or ASAP)
Wed Dec 5: #!: Xiong & Devin (~20’)
Fri Dec 7: #3: Kendra & Drew (~20’)
11/28/07
Seminars this Week
- BCB 544 Project Reports Due (NO CLASS that day!!)
Tentative Schedule:
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
GENOMICS & PROTEOMICS
Xiong: Chp 18 Functional Genomics
• Sequence-based Approaches
• Microarray-based Approaches
• Comparison of SAGE & DNA Microarrays
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
BCB 444/544 Fall 07 Dobbs
11/28/07
5
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
1
#38- Proteomics
11/28/07
Pattern Recognition in Microarray Analysis
Microarray Analysis - Questions
& Answers
• How do hierarchical clustering algorithms work?
• How do we measure the distance between two
clusters? (similarity criteria)
• Clustering (unsupervised learning)
• Uses primary data to group measurements, with no
information from other sources
• Single link
• Complete link
• Average link
• Classification (supervised learning)
• What are “good clusters”?
• Uses known groups of interest (from other sources) to learn
features associated with these groups in primary data and
create rules for associating data with groups of interest
• Big difference between INTRA-cluster distance and INTERcluster distance, i.e., INTRA-cluster distance is minimized while
INTER-cluster distance is maximized
• What are pros & cons of:
• Hierarchical vs K-means clustering
• Clustering vs Classification
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
7
Clustering Metrics
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
8
How Determine Distances?
• A key issue in clustering is to determine what
similarity / distance metric to use
• Often, such metric has a bigger effect on the
results than actual clustering algorithm used!
• When determining the metric, we should take into
account our assumptions about the data and the
goal of the clustering
Intra-cluster distance
Inter-cluster distance
• Min/Max/Avg the distance
between
- All pairs of points in the
cluster OR
- Between centroid and all
points in the cluster
• Single link
• distance between two most
similar members
• Complete link
• distance between two most
similar members
• Average link
• Average distance of all pairs
• Centroid distance
What is the centroid? the "average" of all points of X. The
centroid of a finite set of points can be computed as the arithmetic
mean of each coordinate of the points. Wikipedia
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
9
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
10
11/28/07
12
Methods for Clustering
(Unsupervised Learning)
INTRA- vs INTER-Cluster Distances
• Hierarchical Clustering
• K-Means
• Self Organizing Maps
• (in lab, won’t discuss in lecture)
• …many others….
Good!
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
BCB 444/544 Fall 07 Dobbs
Bad!
11/28/07
11
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
2
#38- Proteomics
11/28/07
Hierarchical Clustering*
*This method was illustrated in Lecture 36,Tables 6.1-MM6.4
•
•
•
Probably most popular clustering algorithm for microarray analysis
First presented in this context by Eisen et al. in 1998
Nodes = genes or groups of genes
Agglomerative (bottom up)
Initially each item is a cluster
1. Compute distance matrix
2. Find two closest nodes (most similar
clusters)
3. Merge them
4. Compute distances from merged node to all
others
5. Repeat until all nodes merged into a single
node
0.
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
13
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Copyright: Russ Altman
Hierarchical Clustering:
Strengths & Weaknesses
Computationally attractive!
Bottom-up is most commonly used method
• Can also perform top-down, which requires
splitting a large group successively
4.
1.
2.
3.
5.
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
2nd
Centroid A
Initial
Centroid A
Choose random points (cluster
centers or centroids) in k
dimensions
Compute distance from each
data point to centroids
Assign each data point to
closest centroid
Compute new cluster centroid as
average of points assigned to
cluster
Loop to (2), stop when cluster
centroids do not move very
much
15
Initial
Centroid B
2nd Centroid B
For K = 2
Two features:
f1 (x-coordinate) & f2 (y-coordinate)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
For simplicity, assume k=2 & objects are 1-dimensional
(Numerical difference is used as distance)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
BCB 444/544 Fall 07 Dobbs
1
2
1
2
1.5
2
6
5
6.5
Assign clusters
7
Compute centroids
Re-assign clusters
7
x
x
x
x
7
5
6.5
Compute centroids
Re-assign clusters
Converged!
6
2.7
1
6
5
2
2
16
Pick seeds
5
2.7
1
11/28/07
K Means Clustering for k=2
A more realistic example
K-Means Clustering Example, for k=2
1
14
K-Means Clustering (Model-based)
• Easy to understand & implement
• Can decide how big to make clusters by
choosing cut level of hierarchy
• Can be sensitive to bad data
• Can have problems interpreting tree
• Can have local minima
Steps in K-means clustering:
0. Objects: 1, 2, 5, 6, 7
1. Randomly select 5 and 6 as centers (centroids)
2. Calculate distance from points to centroids &
assign points to clusters: {1,2,5} & {6,7}
3. Compute new cluster centroids:
(C1 ) = 8/3 = 2.7
(C2 ) = 13/2= 6.5
4. Calculate distance from points to new centroids &
assign data points to new clusters: {1,2} & {5,6,7}
5. Compute new cluster centroids:
(C1 ) = 1.5
(C2 ) = 6.0
6. No change? Converged!
=> Final clusters = {1,2} & {5,6,7}
11/28/07
7
6
5
6
11/28/07
7
17
From S. Mooney
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
18
3
#38- Proteomics
11/28/07
K-Means Clustering:
Strengths & Weaknesses
• Fast, O(N)
• Hard to know which K to choose
• Try several and assess cluster
quality
• Hard to know where to seed the
clusters
• Results can change drastically with
different initial choices for
centroids - as shown in example:
Choice of K? Helpful to have additional
information to aid evaluation of clusters
Example Illustrating
Sensitivity to Seeds
In the above, if start
with B and E as centroids
will converge to {A,B,C}
and {D,E,F}
If start with D and F
Will converge to
{A,B,D,E} {C,F}
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
19
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Running Time
Assumptions
K-Means
Slower
Faster
• Clustering (unsupervised learning)
• Uses primary data to group measurements, with no
information from other sources
• Classification (supervised learning)
Requires distance Requires distance
metric
metric
Parameters
None
K (number of
clusters)
Clusters
Subjective
(only a tree is
returned)
Exactly K
clusters
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
20
Clustering vs Classification
Hierarchical Clustering vs K-Means
Hierarchical
Clustering
11/28/07
11/28/07
• Uses known groups of interest (from other sources) to learn
features associated with these groups in primary data and
create rules for associating data with groups of interest
21
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Classification:
Supervised Learning Task
11/28/07
22
11/28/07
24
Methods for Classification
• Given: a set of microarray experiments, each done
with mRNA from a different patient (but from same
cell type from every patient)
• K-nearest neighbors (KNN)
•
•
•
•
•
Patient’s expression values for each gene constitute
the features, and patient’s disease constitutes the
class
• Do: Learn a model that accurately predicts class
based on features
Linear Models
Logistic Regression
Naive Bayes
Decision Trees
Support Vector Machines
• Outcome: Predict class value of a patient based on
expression levels of his/her genes
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
BCB 444/544 Fall 07 Dobbs
11/28/07
23
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
4
#38- Proteomics
11/28/07
K-Nearest Neighbor (KNN)
Basic KNN Algorithm
• Idea: Use k closest neighbors to label new data
points (e.g., for k = 4)
INPUT:
• Set of data with labels (training data)
• K
• Set of data needing labels
• Distance metric
1. For each unlabeled data point, compute distance to
all labeled data
2. Sort distances, determine closest K neighbors
(smallest distances)
3. Use majority voting to predict label of unlabeled
data point
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
25
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Variations on KNN
11/28/07
26
Compare in Graphical Representation
• Can classify into multiple classes easily
• Weighted KNN - an weight votes of nearby
training samples based on their distance from
unknown sample
• Can set a threshold, p, for the # of votes needed
to win. (If no winner, then either NULL result or
set default winner)
Clustering
Classification
Apply external labels:
RED group & BLUE group
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
27
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Tradeoffs for Clustering vs Classification
SECTION V
BCB 444/544 Fall 07 Dobbs
11/28/07
30
GENOMICS & PROTEOMICS
Xiong: Chp 19 Proteomics
•
•
•
•
• Classification uses previous knowledge, so can
detect weaker signal, but may be biased by
WRONG previous knowledge
11/28/07
28
Chp 19 – Proteomics
• Clustering is not biased by previous knowledge, but
therefore needs stronger signal to discover
clusters
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
29
Technology of Protein Expression Analysis
Post-translational Modification
Protein Sorting
Protein-Protein Interactions
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
5
#38- Proteomics
11/28/07
Proteomics: What do all those proteins do??
ISU Proteomics Resources & Researchers
Biological processes for yeast proteins
Facilities:
Proteomics Facility (Carver Co-lab)
http://www.plantgenomics.iastate.edu/proteomics/
Protein Facility (MBB)
http://www.protein.iastate.edu/
Experiments:
Plant: Rodermel, Wise, Voytas
Animal: Greenlee, perhaps others soon?
Computational Analysis:
Honavar, Wise, Dobbs
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
31
Copyright © 2006
A. Malcolm Campbell
Proteome Analysis: “Traditionally”
using Two-dimensional (2D) gels
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
32
Proteins identified on 2D gels
(IEF/SDS-PAGE)
1st D: Isoelectric focusing (IEF) in pH gradient:
Proteins migrate to isoelectric points & stop moving
Direct protein microsequencing by Edman degradations
-- done at facilities (here at ISU)
-- typically need 5 picomoles
-- often get 10 to 20 amino acids of sequence
Protein mass analysis by MALDI-TOF
-- Matrix-Assisted Laser Desorption/Ionization
Time-Of-Flight Spectroscopy
-- done at facilities (here at ISU)
-- often detect post-translational modifications
(such as phosphorylated Ser, Thr, Tyr)
2nd D: SDS-PAGE (SDS detergent, polyacrylamide gel electrophoresis):
Proteins migrate according to molecular weight
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
33
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Tandem Mass Spectrometry (TS)
to Identify Proteins
Evaluation of 2D gels (IEF/SDS-PAGE)
Advantages:
Visualize hundreds to thousands of proteins
Improved identification of protein spots
Disadvantages:
Limited number of samples can be processed
Mostly abundant proteins visualized
Technically difficult
Page
250-1
11/28/07
34
Figure 8.19 Tandem mass spectrometry for
protein identification
a) ESI creates ionized proteins, represented by
colored shapes with positive charges. Each shape
represents many copies of identical proteins.
b) Ionized proteins are separated based on their
mass to charge ratio (m/z) and sent one at a time
into the activation chamber. Separation and
selection take place in the first of the two MS
devices. The solid purple protein has been selected
for analysis; the other three are temporarily stored
for later analysis.
c) The group of m/z selected ionized proteins enters a
collision cell that is filled with inert argon gas. Gas
molecules collide with proteins, which causes them
to break into two peptide pieces (labeled b and y).
Jonathan Pevsner
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
BCB 444/544 Fall 07 Dobbs
Page 251
35
11/28/07
d) Ionized peptide pieces are sent into second MS
device, which again measures the m/z ratio. A
computer compares spectrum of peptide pieces
to a database of ideal spectra to identify the
original group of identical proteins.
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
36
6
#38- Proteomics
11/28/07
MS data: Protein identification through
peptide fragment identification & separation
Databases of 2D Gel Information
http://ca.expasy.org/ch2d/2d-index.html
Figure 8.20 When a group of identical proteins is broken into peptide pieces, more than one pair of b and y
peptides will be formed. a) One protein sequence and its calculated mass on top, with the b peptides/masses
(gray) and the y peptides/masses (purple) below. b) An experimentally determined mass/charge spectrum from
the peptide in panel a). Some peaks are higher than others, which means that some b/y peptide pieces were
more abundant than others. The spectrum is used to determine each peptide’s amino acid sequence and
protein identity.
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
37
Jonathan Pevsner
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
39
BCB 444/544 Fall 07 Dobbs
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
38
7
Download