#37- Clustering & Classification 11/26/07 Algorithms BCB 444/544

advertisement
#37- Clustering & Classification
Algorithms
11/26/07
Required Reading
BCB 444/544
(before lecture)
Mon Nov 26 - Lecture 37
Lecture 37
Clustering & Classification Algorithms
• Chp 18 Functional Genomics
Brief Review: Microarrays
Wed Nov 28 - Lecture 38
Proteomics & Protein Interactions
• Chp 19 Proteomics
Clustering & Classification
Algorithms
Thurs Nov 30 - Lab 12
R Statistical Computing & Graphics (Garrett Dancik)
http://www.r-project.org/
#37_Nov26
Fri Dec 1 - Lecture 39
Systems Biology
Thanks to:
Doina Caragea, KSU
Dan Nettleton, ISU
(& a bit of Metabolomics & Synthetic Biology)
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
1
Assignments & Announcements
Mon Nov 26 - HW#6 Due
Mon Dec 3
http://www.bcb.iastate.edu/seminars/index.html
Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium,
• Greg Voth Univ. of Utah
• Multiscale Challenge for Biomolecular Systems: A Systematic Approach
ALL BCB 444 & 544 students are REQUIRED to attend
ALL project presentations next week!!!
#2: Tonia (10-15’)
#4: Addie (10-15’)
Thurs Dec 6 - Optional Review Session for Final Exam
40 pts In Class: New material (since Exam 2)
20 pts In Class: Comprehensive
40 pts In Lab Practical (Comprehensive)
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB
• John Abrams Univ Texas Southwestern Medical Center
• Dying Like Flies: Programmed & Unprogrammed Cell Death
3
GENOMICS & PROTEOMICS
4
High-throughput analysis of RNA expression:
• Sequence-based Approaches
• Microarray-based Approaches
• Comparison of SAGE & DNA Microarrays
BCB 444/544 Fall 07 Dobbs
11/26/07
Transcriptome = complete collection of all RNAs in a
cell at a given time
Xiong: Chp 18 Functional Genomics
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 F07 ISU Dobbs #37- Clustering
Transcriptome Analysis
Chp 18 – Functional Genomics
SECTION V
Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Sue Gibson Univ. of Minnesota
• How do soluble sugar levels help regulate plant development, carbon
partitioning and gene expression?
Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Shashi Gadia ComS, ISU
• Harnessing the Potential of XML
Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM)
Will include:
2
BCB List of URLs for Seminars related to Bioinformatics:
(sometime before 5 PM Mon Nov 26)
Wed Dec 5: #!: Xiong & Devin (~20’)
Fri Dec 7: #3: Kendra & Drew (~20’)
11/26/07
Seminars this Week
- BCB 544 Project Reports Due (but no class!)
Tentative Schedule:
BCB 444/544 F07 ISU Dobbs #37- Clustering
Microarrays - "Gene Chips" most popular
Other related methods:
SAGE = Serial Analysis of Gene Expression
MPSS = Massively Parallel Signature Sequencing
11/26/07
5
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
6
1
#37- Clustering & Classification
Algorithms
11/26/07
"Guilt by Association" - Similar expression
patterns suggest potential functions for
novel proteins
Microarray Analysis
Which RNAs are detected?
• mRNAs (& pre-RNAs)
alternatively spliced mRNAs
• rRNAs, tRNAs
• miRNAs, siRNAs, other regulatory RNAs
TF is induced 2X & is known to activate genes
G1 and G2, both of which are induced 6X.
G3 is induced 6X, too. Is it regulated by TF?
2 Major Types of DNA Microarrays:
cDNA = "spotted" = low density, glass slides
= Southern blot on a slide
oligo = "DNA chip" = high density, photolithography
"Affy" chip; computationally designed
•
Clustering of gene expression patterns (with known genes) suggests
potential functions for unknown genes - additional experiments are
required to test these hypothesized functions.
Both types can be made here, in ISU facilities
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
7
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #37- Clustering
Gene Expression Pattern Clusters:
for several thousand genes!!
11/26/07
8
ISU Microarray Researchers & Facilities
Microarray Facilities:
Center for Plant Genomics (ISU PSI) - Pat Schnable
in Carver Co-Lab
GeneChip Facility (ISU Biotech & PSI) - Steve Whitham
in MBB
Each row represents a different gene
Each column represents a different time point
Green indicates repression (decrease in RNA)
Red indicates induction (increase in RNA)
Research Labs:
Pat Schnable (Agron/GDCB) - Facilities for cDNA microarrays
Steve Whitham (PlPath) - Facilities for oligo microarrays
Genes have been clustered so they are near
other genes with similar expression patterns.
Notice that the genes at the bottom were
repressed for the first few time points.
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
Google "microarrays" from ISU website>>> Lots more:
Jo Anne Powell-Coffman, GDCB: genes induced under oxidative stress
Roger Wise, Rico Caldo, Plant Pathology: interaction between multiple
isolates of powdery mildew and multiple genotypes of barley
Chris Tuggle, Animal Science: genes controlling mammalian embryo
development
9
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
10
11/26/07
12
Gene Expression Analysis
ISU Microarray Design & Analysis
• Experimental Design is critical
ISU Course: Stat 416/516X Nettleton
Statistical Design & Analysis of Microarray Experiments
•Dan Nettleton (Stat) - Experimental design & statistical analyses
•Hui-Hsien Chou (Com S) - "Picky" software for designing oligos
•Di Cook (Stat) "exploRase" software for high-dimensional data
analysis & visualization for systems biology
•Tools from Statistics & Machine Learning are needed
ISU Experts: Dan Nettleton & Di Cook, Stat
Vasant Honavar, Com S
Statistics:
ANOVA (Analysis of Variance)
R Statistics package
ML: Clustering & Classification Algorithms
WEKA package
GEPAS
Many additional resources & tools available online
ISU has several Microarray Analysis Suites
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
11
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
2
#37- Clustering & Classification
Algorithms
11/26/07
Microarray Analysis - Questions:
Data Analysis Considerations
• How do hierarchical clustering algorithms work?
• How do we measure the distance between two
clusters? (similarity criteria)
• What are “good clusters”?
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
•
•
•
•
•
13
Normalization
Combining results from replicates
Identifying differentially expressed genes
Dealing with missing values
Static vs. time series
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
Pattern Recognition in Microarray Analysis
• Clustering (unsupervised learning)
Two Views of same Microarray Experiment
• Represented by expression levels across different
samples/experiments/conditions (ie, features=samples)
• Goal: categorize genes
• Classification (supervised learning)
• Data points are samples (eg, patients)
• Uses known groups of interest (from other sources) to learn
features associated with these groups in primary data and
create rules for associating data with groups of interest
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
• Represented by expression levels of different genes
(ie, features=genes)
• Goal: categorize samples
15
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
Two Ways to View Microarray Data
Doina Caragea
AB00014_at AB00015_at
A28202_ac
...
Person Gene
A28202_ac
Person 1
1142.0
321.0
2567.2
...
Person 1
1142.0
321.0
2567.2
...
Person 2
586.3
586.1
759.0
...
Person 2
586.3
586.1
759.0
...
Person 3
105.2
559.3
3210.7
...
Person 3
105.2
559.3
3210.7
...
Person 4
42.8
692.1
812.0
...
Person 4
42.8
692.1
812.0
AB00014_at AB00015_at
...
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
BCB 444/544 Fall 07 Dobbs
16
11/26/07
18
...
.
11/26/07
11/26/07
Data Points are Genes
Person Gene
BCB 444/544 F07 ISU Dobbs #37- Clustering
14
• Data points are genes
• Uses primary data to group measurements, with no information
from other sources
Doina Caragea
11/26/07
17
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
3
#37- Clustering & Classification
Algorithms
11/26/07
Data Points are Samples
Doina Caragea
Clustering: Unsupervised Learning Task 1
Person Gene
A28202_ac
AB00014_at AB00015_at
Person 1
1142.0
321.0
2567.2
...
Person 2
586.3
586.1
759.0
...
Person 3
105.2
559.3
3210.7
...
Person 4
42.8
692.1
812.0
...
• Given: a set of microarray results in which gene
expression levels are measured under different
experimental conditions
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
BCB 444/544 F07 ISU Dobbs #37- Clustering
• Do: Cluster the genes, where a gene is described by
its expression levels under different conditions
• Outcome: Groups genes into clusters, where
expression of all members of a cluster tend to go
up or down together
11/26/07
19
Doina Caragea
Example: Groups of Genes are Clustered
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
20
Visualizing Expression Patterns for
Different Clusters
(Green = up-regulated, Red = down-regulated)
Gene Cluster 2, size=43
Genes
Normalized
expression
Gene Cluster 1, size=20
Time (10-minute intervals)
(from Sharan & Shamir, 2000)
Experiments (Samples)
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
21
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
22
Examples
Clustering: Unsupervised Learning Task 2
• Given: a set of microarray results in which
experimental samples correspond to different
patients
• Cluster samples from mice subjected to a variety
of toxic compounds
• Cluster samples from cancer patients to discover
different subtypes of a cancer
• Cluster samples taken at different timepoints
• Do: Cluster the experiments
• Outcome: Groups samples according to similarities
in gene expression profiles
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
23
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
24
4
#37- Clustering & Classification
Algorithms
11/26/07
Classification:
Supervised Learning Task
Supervision: Add Class Values
Doina Caragea
Person Gene
A28202_ac
AB00014_at AB00015_at . . .
Person 1
1142.0
321.0
2567.2
...
normal
Person 2
586.3
586.1
759.0
...
cancer
Person 3
105.2
559.3
3210.7
...
normal
Person 4
42.8
692.1
812.0
...
cancer
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
• Given: a set of microarray experiments, each done
with mRNA from a different patient (but from same
cell type from every patient)
Class
BCB 444/544 F07 ISU Dobbs #37- Clustering
Patient’s expression values for each gene constitute
the features, and patient’s disease constitutes the
class
• Do: Learn a model that accurately predicts class
based on features
• Outcome: Predict class value of a patient based on
expression levels of his/her genes
11/26/07
25
Doina Caragea
Methods for Clustering
• (in lab, won’t discuss in lecture)
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
27
Doina Caragea
Distance Metrics for 2 n-Dimensional Vectors
• Euclidean distance
2
• Correlation coefficient
where
!
2
$ (x
i
" x = sqrt(E(x ) # E(x) )
28
• Compare computed clusters with known clusters (if there
are any) to see how closely they match
Good clusters will contain all known and no “wrong”
cluster members
# µx )(y i # µy )
i
2
11/26/07
• Compare INTRA-cluster distances with INTER-cluster
distances.
Good clusters should have big difference
2
D(x, y) = sqrt[(x1 " y1 ) + (x 2 " y 2 ) + ...+ (x n " y n ) ]
cov(x, y)
"(x, y) =
=
std(x)std(y)
BCB 444/544 F07 ISU Dobbs #37- Clustering
Measuring Quality of Clusters
(e.g., for a series of expression measurements)
!
26
• A key issue in clustering is to determine what
similarity / distance metric to use
• Often, such metric has a bigger effect on the
results than actual clustering algorithm used!
• When determining the metric, we should take into
account our assumptions about the data and the
goal of the clustering
• …many others….
2
11/26/07
Clustering Metrics
• Hierarchical Clustering
• K-Means
• Self Organizing Maps
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
% x% y
and E(x) is expected value of X
• Other metrics are also used…
!
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
29
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
30
5
#37- Clustering & Classification
Algorithms
11/26/07
INTRA- vs INTER-Cluster Distances
How Determine Distances?
Intra-cluster distance
Inter-cluster distance
• Min/Max/Avg the distance
between
- All pairs of points in the
cluster OR
- Between centroid and all
points in the cluster
• Single link
• distance between two most
similar members
• Complete link
• distance between two most
similar members
• Average link
• Average distance of all pairs
• Centroid distance
Good!
Doina Caragea
What is the centroid? the "average" of all points of X. The
Bad!
BCB 444/544 F07 ISU Dobbs #37- Clustering
centroid of a finite set of points can be computed as the arithmetic
mean of each coordinate of the points. Wikipedia
11/26/07
31
Doina Caragea
Similarity Criterion: Single Link
BCB 444/544 F07 ISU Dobbs #37- Clustering
• Cluster similarity = similarity of two least similar
members
Potentially long
and skinny clusters
BCB 444/544 F07 ISU Dobbs #37- Clustering
32
Similarity Criterion: Complete Link
• Cluster similarity = similarity of two most similar
members
Doina Caragea
11/26/07
11/26/07
33
Tight clusters
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
34
Hierarchical Clustering*
Similarity Criterion: Average Link
*This method was illustrated in Lecture 36,Tables 6.1-MM6.4
•
•
•
• Cluster similarity = average similarity of all pairs
Probably most popular clustering algorithm for microarray analysis
First presented in this context by Eisen et al. in 1998
Nodes = genes or groups of genes
Agglomerative (bottom up)
Initially each item is a cluster
1. Compute distance matrix
2. Find two closest nodes (most similar
clusters)
3. Merge them
4. Compute distances from merged node to all
others
5. Repeat until all nodes merged into a single
node
0.
This is perhaps most
widely used similarity
criterion
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
35
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
36
6
#37- Clustering & Classification
Algorithms
11/26/07
Hierachical Clustering Example:
Using Single Link Criterion to Iteratively
“Combine” Data Points
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
37
Copyright: Russ AltmanBCB 444/544 F07 ISU Dobbs #37- Clustering
Hierarchical Clustering:
Strengths & Weaknesses
Computationally attractive!
Bottom-up is most commonly used method
• Can also perform top-down, which requires
splitting a large group successively
4.
1.
2.
3.
5.
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
39
2nd
Centroid A
Initial
Centroid A
Choose random points (cluster
centers or centroids) in k
dimensions
Compute distance from each
data point to centroids
Assign each data point to
closest centroid
Compute new cluster centroid as
average of points assigned to
cluster
Loop to (2), stop when cluster
centroids do not move very
much
Doina Caragea
Initial
Centroid B
2nd Centroid B
For K = 2
Two features:
f1 (x-coordinate) & f2 (y-coordinate)
BCB 444/544 F07 ISU Dobbs #37- Clustering
For simplicity, assume k=2 & objects are 1-dimensional
(Numerical difference is used as distance)
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
1
2
1
2
1
1.5
2
6
5
6.5
Assign clusters
7
Compute centroids
Re-assign clusters
7
x
x
x
x
7
5
6.5
Compute centroids
Re-assign clusters
Converged!
6
2.7
2
6
5
2
1
40
Pick seeds
5
2.7
1
11/26/07
K Means Clustering for k=2
A more realistic example
K-Means Clustering Example, for k=2
Steps in K-means clustering:
0. Objects: 1, 2, 5, 6, 7
1. Randomly select 5 and 6 as centers (centroids)
2. Calculate distance from points to centroids &
assign points to clusters: {1,2,5} & {6,7}
3. Compute new cluster centroids:
(C1 ) = 8/3 = 2.7
(C2 ) = 13/2= 6.5
4. Calculate distance from points to new centroids &
assign data points to new clusters: {1,2} & {5,6,7}
5. Compute new cluster centroids:
(C1 ) = 1.5
(C2 ) = 6.0
6. No change? Converged!
=> Final clusters = {1,2} & {5,6,7}
38
K-Means Clustering (Model-based)
• Easy to understand & implement
• Can decide how big to make clusters by
choosing cut level of hierarchy
• Can be sensitive to bad data
• Can have problems interpreting tree
• Can have local minima
Doina Caragea
11/26/07
7
6
5
6
11/26/07
7
41
From S. Mooney
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
42
7
#37- Clustering & Classification
Algorithms
11/26/07
K-Means Clustering:
Strengths & Weaknesses
• Fast, O(N)
• Hard to know which K to choose
• Try several and assess cluster
quality
• Hard to know where to seed the
clusters
• Results can change drastically with
different initial choices for
centroids - as shown in example:
Doina Caragea
Choice of K? Helpful to have additional
information to aid evaluation of clusters
Example Illustrating
Sensitivity to Seeds
In the above, if start
with B and E as centroids
will converge to {A,B,C}
and {D,E,F}
If start with D and F
Will converge to
{A,B,D,E} {C,F}
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
43
Doina Caragea
Running Time
Assumptions
Doina Caragea
K-Means
Slower
Faster
None
K (number of
clusters)
Clusters
Subjective
(only a tree is
returned)
Exactly K
clusters
BCB 444/544 F07 ISU Dobbs #37- Clustering
• Uses primary data to group measurements, with no
information from other sources
11/26/07
• Classification (supervised learning)
• Uses known groups of interest (from other sources)
to learn features associated with these groups in
primary data and create rules for associating data
with groups of interest
45
Doina Caragea
Compare in Graphical Representation
Clustering
44
• Clustering (unsupervised learning)
Requires distance Requires distance
metric
metric
Parameters
11/26/07
Clustering vs Classification
Hierarchical Clustering vs K-Means
Hierarchical
Clustering
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
46
Tradeoffs
• Clustering is not biased by previous knowledge, but
therefore needs stronger signal to discover
clusters
• Classification uses previous knowledge, so can
detect weaker signal, but may be biased by
WRONG previous knowledge
Classification
Apply external labels:
RED group & BLUE group
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
47
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
48
8
#37- Clustering & Classification
Algorithms
11/26/07
Methods for Classification
•
•
•
•
•
•
K-Nearest Neighbor (KNN)
• Idea: Use k closest neighbors to label new data
points (e.g., for k = 4)
K-nearest neighbors
Linear Models
Logistic Regression
Naive Bayes
Decision Trees
Support Vector Machines
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
49
Basic KNN Algorithm
11/26/07
50
• Some of this is material I discussed or wrote on
blackboard
• It is provided here for your information & for
future reference
• It will not be covered on the Final Exam!
1.
For each unlabeled data point, compute distance to all
labeled data
2. Sort distances, determine closest K neighbors (smallest
distances)
3. Use majority voting to predict label of unlabeled data point.
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 F07 ISU Dobbs #37- Clustering
SLIDES FOLLOWING THIS ONE
WERE NOT SHOWN IN LECTURE
INPUT:
• Set of data with labels (training data)
• K
• Set of data needing labels
• Distance metric
Doina Caragea
Doina Caragea
11/26/07
51
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
52
cDNA Microarrays
Microarray Technology
Details re: 2 types of arrays
• Glass slides or similar supports containing cDNA sequences that
serve as probes for measuring mRNA levels in target samples
• cDNA “Slides”
• Short-oligonucleotide “Chips”
• cDNAs are arrayed on each slide in a grid of spots.
• Each spot contains thousands of copies of a sequence that
matches a segment of a gene’s coding sequence.
• A sequence and its complement are present in the same spot.
A few words about microarray terminology:
• Different spots typically represent different genes, but some
genes may be represented by multiple spots
• Probes refers to cDNAs or DNA oligos attached to slide or chip
• Target refers to labeled mRNA or cRNA in solution, which is
hybridized to probes attached to slide or chip
Note: this is opposite of terminology used in discussing Southern blots, etc,
in which target is DNA attached to solid matrix & probe is labeled RNA or
cDNA in solution, which is hybridized to targets attached to matrix
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
53
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
54
9
#37- Clustering & Classification
Algorithms
11/26/07
cDNA microarray slide 1
cDNA Microarray Probes
spot for
gene 201
AAAAAAAAA...A
cDNA
GATATG...
GATATG...
GATATG...
GATATG...
spot for
gene 576
spot for
gene 576
TTCCAG...
TTCCAG...
TTCCAG...
TTCCAG...
TTCCAG...
TTCCAG...
...
...
TTTTTTTTTT...T
EST
GATATG...
GATATG...
EST
...
mRNA
spot for
gene 201
...
• Expressed Sequence Tags (ESTs) commonly serve as probes on
cDNA microarrays.
• ESTs are small pieces of cDNA sequence (usually 200 to 500
nucleotides long) that has been reverse-transcribed from mRNA
cDNA microarray slide 2
Each spot contains many copies of a sequence along with its complement (not shown).
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
55
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
56
Spotting Probes on the Microarray
8 X 4 Print Head
Spotting cDNA Probes on Microarrays
• Solutions containing probes are transferred from a plate to
a microarray slide by a robotic arrayer.
plate with wells holding probes in solution
• The robot picks up a small amount of solution containing a
probe by dipping a pin into a well on a plate.
• The robot then deposits a small drop of the solution on the
microarray slide by touching the pin onto the slide.
• The pin is washed and the process is repeated for a
different probe.
• Most arrayers use several pins so that multiple probes are
spotted simultaneously on a slide.
• Most arrayers print multiple slides together so that probes
are deposited on several slides prior to washing.
All spots of the same color are made at the same time.
All spots in the same sector are made by the same pin.
microarray slide
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
57
cDNA Microarrays to Measure mRNA Levels
• RNA is extracted from a target sample of interest.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
58
cDNA Microarrays to Measure mRNA Levels
(cont.)
• Usually two samples, dyed with different dyes, are hybridized to
a single slide.
• mRNAs are reverse transcribed into cDNA.
• The resulting cDNAs are labeled with a fluorescent dye and are
incubated with the microarray slide.
• The dyes fluoresce at different wavelengths so it is possible to
get separate images for each dye.
• Dyed cDNA sequences hybridize to complementary probes
spotted on the array.
• Images from the scanner are black and white, but it is typical to
display Cy3 images as green and Cy5 images are displayed as red.
• A laser excites the dye and a scanner records an image of the
slide.
• It is common to superimpose the two images, using yellow to
indicate a mixture of green and red.
• The image is quantified to obtain measures of fluorescence
intensity for each pixel.
• Pixel values are processed to obtain measures of mRNA
abundance for each probe on the array.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
59
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
60
10
#37- Clustering & Classification
Algorithms
11/26/07
cDNA Microarrays to Measure mRNA Levels:
Problems with cDNA Microarrays: Difficult to
Make Meaningful Comparisons between Genes
Step 1: Prepare Microarray Slide & Sample mRNAs
Microarray Slide
ATCTA...A
ATCTA...A
ATCTA...A
ACGGG...T
ACGGG...T
ACGGG...T
CGATA...G
CGATA...G
CGATA...G
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
61
Step 2: Convert mRNA to cDNA & label with Fluorescent Dyes
Spots
(Probes)
Sample 2
???
???
???
????
????
????
??????
????
??
????
??
???
Unknown
mRNA
Sequences
(Target)
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
62
Step 3: Mix Labeled cDNA and Hybridize to Slide
Sample 1
Sample 1
??????????
??????????
ACCTG...G
ACCTG...G
ACCTG...G
TTCTG...A
TTCTG...A
TTCTG...A
GGCTT...C
GGCTT...C
GGCTT...C
ATCTA...A
ATCTA...A
ATCTA...A
ACGGG...T
ACGGG...T
ACGGG...T
CGATA...G
CGATA...G
CGATA...G
??????????
??????????
TTCTG...A
TTCTG...A
TTCTG...A
??
cDNA Microarrays to Measure mRNA Levels:
cDNA Microarrays to Measure mRNA Levels:
ACCTG...G
ACCTG...G
ACCTG...G
Dan Nettleton, ISU
Statistics 416/516X
????
?
GGCTT...C
GGCTT...C
GGCTT...C
????
??
???? ????
????
????
??
????
????
??
??????????
???
TTCTG...A
TTCTG...A
TTCTG...A
?
• Within-gene comparisons of multiple cell types or across
multiple treatment conditions are much more meaningful.
ACCTG...G
ACCTG...G
ACCTG...G
Sample 1
???
• Measures of mRNA levels are affected by several factors that
are partly or completely confounded with genes (e.g., EST source
plate, EST well, print pin, slide position, length of mRNA
sequence, base composition of mRNA sequence, specificity of
probe sequence, etc.).
??????????
??????????
??????????
??????????
??????????
ATCTA...A
ATCTA...A
ATCTA...A
Sample 2
??????????
Sample 2
??????????
??????????
CGATA...G
CGATA...G
CGATA...G
??????????
ACGGG...T
ACGGG...T
ACGGG...T
??????????
??????????
??????????
??????????
??????????
??????????
??????????
??????????
??????????
??????????
??????????
GGCTT...C
GGCTT...C
GGCTT...C
??????????
??????????
??????????
??????????
??????????
??????????
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
63
cDNA Microarrays to Measure mRNA Levels:
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
64
Pros/Cons of Spotted cDNA Arrays
Step 5: Excite Dye with Laser, Scan & Quantify Signals
Sample 1
ACCTG...G
TTCTG...A
7652
138
5708
4388
GGCTT...C
ATCTA...A
8566
765
1208
13442
ACGGG...T
CGATA...G
6784
9762
67
239
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
• Many sources of variation in the manufacture of
these arrays, print tips, lab, etc.
• Contamination
• Uneven distribution
• Flexible, can put any cDNA on slide
Sample 2
11/26/07
65
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
66
11
#37- Clustering & Classification
Algorithms
11/26/07
DNA Oligonucleotide “Chips”
• An oligonucleotide microarray is a microarray whose probes
consist of synthetically created DNA oligonucleotides.
• Probes sequences are chosen to have good and relatively uniform
hybridization characteristics
• A probe is chosen to match a portion of its target mRNA
transcript that is unique to that sequence.
• Oligo probes can distinguish among multiple mRNA transcripts
with similar sequences.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
67
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
68
www.affymetrix.com BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
69
www.affymetrix.com BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
70
Simplified Example
Oligo Microarray Fabrication
...gene 1
oligo probe
for gene 1
...
• Oligo sequences can be synthesized on a slide or chip using
various commercial technologies.
ATTACTAAGCATAGATTGCCGTATA
• In one approach, sequences are synthesized on a slide using inkjet technology similar to that used in color printers. Separate
cartridges for the four bases (A, C, G, T) are used to build
nucleotides on a slide.
...
gene 2
Shared green regions indicate
high degree of sequence similarity
throughout much of the transcript
Dan Nettleton, ISU
Statistics 416/516X
• Oligos can be synthesized and stored in solution for spotting as
is done with cDNA microarrays.
...
•
GCGTATGGCATGCCCGGTAAACTGG
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
Affymetrix uses a photolithographic approach.
oligo probe
for gene 2
11/26/07
71
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
72
12
#37- Clustering & Classification
Algorithms
11/26/07
Affymetrix Probe Sets
Affymetrix GeneChips
• Affymetrix (www.affymetrix.com) manufactures GeneChips,
oligonucleotide arrays.
• A probe set is used to measure mRNA levels of a single gene.
• Each gene (or sequence of interest or feature) is represented by
multiple short (25-nucleotide) oligo probes.
• Each probe cell contains millions of copies of one oligo.
• Each probe set consists of multiple probe cells.
• Each oligo is intended to be 25 nucleotides in length.
• Some GeneChips include probes for around 60,000 genes.
• Probe cells in a probe set are arranged in probe pairs.
• mRNA that has been extracted from a biological sample can be
labeled (dyed) and hybridized to a GeneChip in a manner similar to
that described for cDNA microarrays.
• Each probe pair contains a perfect match (PM) probe cell and
a mismatch (MM) probe cell.
• Only one sample is hybridized to each GeneChip rather than two as
in the case of cDNA microarrays.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
73
5’
Reference Sequence
• A MM oligo is identical to a PM oligo except that the middle
nucleotide (13th of 25) is intentionally replaced by its
complementary nucleotide.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
74
Different Probe Pairs Represent Different
Parts of the Same Gene
Affymetrix GeneChips
mRNA reference
sequence
• A PM oligo perfectly matches part of a gene sequence.
3’
gene sequence
Spaced DNA probe
pairs
…TGTGATGGTGGGAATGGGTCAGAAGGGACTCCTATGTGGGTGACGAGGCC…
TTACCCAGTCTTCCCTGAGGATACAC
Perfect match oligo
TTACCCAGTCTTGCCTGAGGATACAC
Mismatch oligo
Probe Set
PM
MM
PM - 25 bases complementary to gene
MM - Middle base is different
Probe Pair
Dan Nettleton, ISU
Statistics 416/516X
PM
MM
Probe Cell
MM
BCB 444/544 F07 ISU Dobbs #37- Clustering
Probe Cell
11/26/07
Probes are selected to be specific to the target gene
and have good hybridization characteristics.
PM
75
Obtaining Labeled Target for Affy Chips
1.
RNA  single-stranded cDNA
2.
Single-stranded cDNA  double-stranded cDNA
3.
Double-strand cDNA  labeled single-stranded cRNA
complementary to coding sequence
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
76
Pros/Cons GeneChip Arrays
• Consistent manufacture -> good standardization
• Comparable across experiments
• Design is time-consuming, good for large sets of
chips
• Can only see what is on the chip
Number of copies of each sequence gets amplified in conversion
to cRNA.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
77
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
78
13
#37- Clustering & Classification
Algorithms
11/26/07
Reminder:
Why do microarray experiments?
Affymetrix Data Processing Pipeline
Experiment preparation
*.exp file
Analysis output
*.chp file
Image of the scanned
probe array
*.dat file
• Compare two (or more) conditions to identify
differentially expressed genes
Probe Cell Intensity file
*.cel file
• Exploratory analysis
• Control/treatment
• Disease/normal
• What genes are expressed in response to drought stress?
• What gene expression changes occur during normal retinal
development?
MicroArray
Suite or other
analysis
software
Dan Nettleton, ISU
Statistics 416/516X
• Diagnostic & prognostic tool development:
• Can we predicting certain conditions (breast cancer vs
normal)
• Can we identify patterns of gene expression that predict a
patient’s response to treatment/drug?
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
79
Doina Caragea
Differential Gene Expression
Inoculated
Doina Caragea
Inoculated
BCB 444/544 F07 ISU Dobbs #37- Clustering
80
• Find patterns in data to see what genes are
expressed under different conditions
• Analysis includes clustering methods
• Used when little or no prior knowledge exists about
the problem
Mutant 2
Control
11/26/07
Exploratory Analysis
• Are there significant differences in expression
level between the conditions?
• Analysis of Variance (ANOVA)
Mutant 1
BCB 444/544 F07 ISU Dobbs #37- Clustering
Control
11/26/07
81
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
82
11/26/07
84
Microarray data analysis
Classification
Preprocessing
normalization
scatter plots
• Learn characteristic patterns from a training set
and evaluate with a test set.
• Classify tumor types based on expression
patterns
• Predict disease susceptibility, stages, etc.
Inferential statistics
t-test
ANOVA
Exploratory (descriptive) statistics
distances
clustering
principal components analysis (PCA)
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
11/26/07
83
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
14
#37- Clustering & Classification
Algorithms
11/26/07
Pre-processing
Inferential statistics
Main goal of data preprocessing is to remove any systematic
bias in the data as completely as possible, while preserving
variation in gene expression that occurs because of
Biologically relevant changes in transcription.
Observed differences in gene expression could be due to
transcriptional changes, or they could be caused by
artifacts such as:
• different labeling efficiencies of Cy3, Cy5
•
•
•
•
Doina Caragea
uneven spotting of DNA onto an array surface
variations in RNA purity or quantity
variations in washing efficiency
variations in scanning efficiency
BCB 444/544 F07 ISU Dobbs #37- Clustering
Inferential statistics are used to make inferences
about a population from a sample.
Hypothesis testing is a common form of inferential
statistics. A null hypothesis is stated, such as:
“There is no difference in signal intensity for the gene
expression measurements in normal and diseased
samples.” The alternative hypothesis is that there
is a difference.
We use a test statistic to decide whether to accept or
reject the null hypothesis. For many applications,
we set the significance level a to p < 0.05.
Page 191
85
11/26/07
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
Page 199
86
11/26/07
Limitations of Microarrays
Descriptive statistics
Microarray data are highly dimensional: there are
many thousands of measurements made from a small
number of samples.
• Link between proteins and expressed RNA not always
clear
• Difficult to compare between microarray platforms:
Descriptive (exploratory) statistics help you to find
meaningful patterns in the data.
A first step is to arrange the data in a matrix.
Next, use a distance metric to define the relatedness
of the different data points. Two commonly used
distance metrics are:
-- Euclidean distance
-- Pearson coefficient of correlation
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
BCB 444/544 Fall 07 Dobbs
Page 203
87
11/26/07
• Only see what is on the microarray
• Gene finding is still an art
• Other coding regions, “dark matter” on genome
• But now microarrays for these are being developed, too!
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
88
15
Download