PPT

advertisement
BCB 444/544
Lecture 37
Brief Review: Microarrays
Clustering & Classification
Algorithms
#37_Nov26
Thanks to:
Doina Caragea, KSU
Dan Nettleton, ISU
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
1
Required Reading
(before lecture)
Mon Nov 26 - Lecture 37
Clustering & Classification Algorithms
• Chp 18 Functional Genomics
Wed Nov 28 - Lecture 38
Proteomics & Protein Interactions
• Chp 19 Proteomics
Thurs Nov 30 - Lab 12
R Statistical Computing & Graphics (Garrett Dancik)
http://www.r-project.org/
Fri Dec 1 - Lecture 39
Systems Biology
(& a bit of Metabolomics & Synthetic Biology)
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
2
Assignments & Announcements
Mon Nov 26 - HW#6 Due
Mon Dec 3
(sometime before 5 PM Mon Nov 26)
- BCB 544 Project Reports Due (but no class!)
ALL BCB 444 & 544 students are REQUIRED to attend
ALL project presentations next week!!!
Tentative Schedule:
Wed Dec 5: #!: Xiong & Devin (~20’)
Fri Dec 7: #3: Kendra & Drew (~20’)
#2: Tonia (10-15’)
#4: Addie (10-15’)
Thurs Dec 6 - Optional Review Session for Final Exam
Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM)
Will include:
40 pts In Class: New material (since Exam 2)
20 pts In Class: Comprehensive
40 pts In Lab Practical (Comprehensive)
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
3
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
http://www.bcb.iastate.edu/seminars/index.html
Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium,
• Greg Voth Univ. of Utah
• Multiscale Challenge for Biomolecular Systems: A Systematic Approach
Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Sue Gibson Univ. of Minnesota
• How do soluble sugar levels help regulate plant development, carbon
partitioning and gene expression?
Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Shashi Gadia ComS, ISU
• Harnessing the Potential of XML
Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB
• John Abrams Univ Texas Southwestern Medical Center
• Dying Like Flies: Programmed & Unprogrammed Cell Death
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
4
Chp 18 – Functional Genomics
SECTION V
GENOMICS & PROTEOMICS
Xiong: Chp 18 Functional Genomics
• Sequence-based Approaches
• Microarray-based Approaches
• Comparison of SAGE & DNA Microarrays
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
5
Transcriptome Analysis
Transcriptome = complete collection of all RNAs in a
cell at a given time
High-throughput analysis of RNA expression:
Microarrays - "Gene Chips" most popular
Other related methods:
SAGE = Serial Analysis of Gene Expression
MPSS = Massively Parallel Signature Sequencing
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
6
Microarray Analysis
Which RNAs are detected?
• mRNAs (& pre-RNAs)
alternatively spliced mRNAs
• rRNAs, tRNAs
• miRNAs, siRNAs, other regulatory RNAs
2 Major Types of DNA Microarrays:
cDNA = "spotted" = low density, glass slides
= Southern blot on a slide
oligo = "DNA chip" = high density, photolithography
"Affy" chip; computationally designed
•
Both types can be made here, in ISU facilities
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
7
"Guilt by Association" - Similar expression
patterns suggest potential functions for
novel proteins
TF is induced 2X & is known to activate genes
G1 and G2, both of which are induced 6X.
G3 is induced 6X, too. Is it regulated by TF?
Clustering of gene expression patterns (with known genes) suggests
potential functions for unknown genes - additional experiments are
required to test these hypothesized functions.
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
8
Gene Expression Pattern Clusters:
for several thousand genes!!
Each row represents a different gene
Each column represents a different time point
Green indicates repression (decrease in RNA)
Red indicates induction (increase in RNA)
Genes have been clustered so they are near
other genes with similar expression patterns.
Notice that the genes at the bottom were
repressed for the first few time points.
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
9
ISU Microarray Researchers & Facilities
Microarray Facilities:
Center for Plant Genomics (ISU PSI) - Pat Schnable
in Carver Co-Lab
GeneChip Facility (ISU Biotech & PSI) - Steve Whitham
in MBB
Research Labs:
Pat Schnable (Agron/GDCB) - Facilities for cDNA microarrays
Steve Whitham (PlPath) - Facilities for oligo microarrays
Google "microarrays" from ISU website>>> Lots more:
Jo Anne Powell-Coffman, GDCB: genes induced under oxidative stress
Roger Wise, Rico Caldo, Plant Pathology: interaction between multiple
isolates of powdery mildew and multiple genotypes of barley
Chris Tuggle, Animal Science: genes controlling mammalian embryo
development
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
10
ISU Microarray Design & Analysis
• Experimental Design is critical
ISU Course: Stat 416/516X Nettleton
Statistical Design & Analysis of Microarray Experiments
•Dan Nettleton (Stat) - Experimental design & statistical analyses
•Hui-Hsien Chou (Com S) - "Picky" software for designing oligos
•Di Cook (Stat) "exploRase" software for high-dimensional data
analysis & visualization for systems biology
•Tools from Statistics & Machine Learning are needed
ISU Experts: Dan Nettleton & Di Cook, Stat
Vasant Honavar, Com S
Statistics:
ANOVA (Analysis of Variance)
R Statistics package
ML: Clustering & Classification Algorithms
WEKA package
GEPAS
Many additional resources & tools available online
ISU has several Microarray Analysis Suites
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
11
Gene Expression Analysis
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
12
Microarray Analysis - Questions:
• How do hierarchical clustering algorithms work?
• How do we measure the distance between two
clusters? (similarity criteria)
• What are “good clusters”?
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
13
Data Analysis Considerations
•
•
•
•
•
Normalization
Combining results from replicates
Identifying differentially expressed genes
Dealing with missing values
Static vs. time series
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
14
Pattern Recognition in Microarray Analysis
• Clustering (unsupervised learning)
• Uses primary data to group measurements, with no information
from other sources
• Classification (supervised learning)
• Uses known groups of interest (from other sources) to learn
features associated with these groups in primary data and
create rules for associating data with groups of interest
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
15
Two Views of same Microarray Experiment
• Data points are genes
• Represented by expression levels across different
samples/experiments/conditions (ie, features=samples)
• Goal: categorize genes
• Data points are samples (eg, patients)
• Represented by expression levels of different genes
(ie, features=genes)
• Goal: categorize samples
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
16
Two Ways to View Microarray Data
Doina Caragea
Person Gene
A28202_ac
AB00014_at AB00015_at
...
Person 1
1142.0
321.0
2567.2
...
Person 2
586.3
586.1
759.0
...
Person 3
105.2
559.3
3210.7
...
Person 4
42.8
692.1
812.0
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
17
Data Points are Genes
Doina Caragea
Person Gene
A28202_ac
AB00014_at AB00015_at
...
Person 1
1142.0
321.0
2567.2
...
Person 2
586.3
586.1
759.0
...
Person 3
105.2
559.3
3210.7
...
Person 4
42.8
692.1
812.0
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
18
Data Points are Samples
Doina Caragea
Person Gene
A28202_ac
AB00014_at AB00015_at
...
Person 1
1142.0
321.0
2567.2
...
Person 2
586.3
586.1
759.0
...
Person 3
105.2
559.3
3210.7
...
Person 4
42.8
692.1
812.0
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
19
Clustering: Unsupervised Learning Task 1
• Given: a set of microarray results in which gene
expression levels are measured under different
experimental conditions
• Do: Cluster the genes, where a gene is described by
its expression levels under different conditions
• Outcome: Groups genes into clusters, where
expression of all members of a cluster tend to go
up or down together
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
20
Example: Groups of Genes are Clustered
Genes
(Green = up-regulated, Red = down-regulated)
Experiments (Samples)
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
21
Visualizing Expression Patterns for
Different Clusters
Gene Cluster 2, size=43
Normalized
expression
Gene Cluster 1, size=20
Time (10-minute intervals)
(from Sharan & Shamir, 2000)
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
22
Clustering: Unsupervised Learning Task 2
• Given: a set of microarray results in which
experimental samples correspond to different
patients
• Do: Cluster the experiments
• Outcome: Groups samples according to similarities
in gene expression profiles
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
23
Examples
• Cluster samples from mice subjected to a variety
of toxic compounds
• Cluster samples from cancer patients to discover
different subtypes of a cancer
• Cluster samples taken at different timepoints
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
24
Supervision: Add Class Values
Doina Caragea
Person Gene
A28202_ac
AB00014_at AB00015_at . . .
Person 1
1142.0
321.0
2567.2
...
normal
Person 2
586.3
586.1
759.0
...
cancer
Person 3
105.2
559.3
3210.7
...
normal
Person 4
42.8
692.1
812.0
...
cancer
.
.
.
.
.
.
...
.
.
.
.
.
.
...
.
.
.
.
.
.
...
BCB 444/544 F07 ISU Dobbs #37- Clustering
Class
11/26/07
25
Classification:
Supervised Learning Task
• Given: a set of microarray experiments, each done
with mRNA from a different patient (but from same
cell type from every patient)
Patient’s expression values for each gene constitute
the features, and patient’s disease constitutes the
class
• Do: Learn a model that accurately predicts class
based on features
• Outcome: Predict class value of a patient based on
expression levels of his/her genes
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
26
Methods for Clustering
• Hierarchical Clustering
• K-Means
• Self Organizing Maps
• (in lab, won’t discuss in lecture)
• …many others….
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
27
Clustering Metrics
• A key issue in clustering is to determine what
similarity / distance metric to use
• Often, such metric has a bigger effect on the
results than actual clustering algorithm used!
• When determining the metric, we should take into
account our assumptions about the data and the
goal of the clustering
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
28
Distance Metrics for 2 n-Dimensional Vectors
(e.g., for a series of expression measurements)
• Euclidean distance
D(x, y)  sqrt[(x1  y1) 2  (x 2  y 2 ) 2  ... (x n  y n ) 2 ]
• Correlation coefficient
cov( x, y)
(x, y) 

std(x)std(y)
where
(x
i
 x  sqrt(E(x 2 )  E(x) 2 )
i
 x )(y i  y )
 x y
and E(x) is expected value of X
• Other metrics are also used…

Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
29
Measuring Quality of Clusters
• Compare INTRA-cluster distances with INTER-cluster
distances.
Good clusters should have big difference
• Compare computed clusters with known clusters (if there
are any) to see how closely they match
Good clusters will contain all known and no “wrong”
cluster members
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
30
INTRA- vs INTER-Cluster Distances
Good!
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
Bad!
11/26/07
31
How Determine Distances?
Intra-cluster distance
Inter-cluster distance
• Min/Max/Avg the distance
between
- All pairs of points in the
cluster OR
- Between centroid and all
points in the cluster
• Single link
• distance between two most
similar members
• Complete link
• distance between two most
similar members
• Average link
• Average distance of all pairs
• Centroid distance
What is the centroid? the "average" of all points of X. The
centroid of a finite set of points can be computed as the arithmetic
mean of each coordinate of the points. Wikipedia
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
32
Similarity Criterion: Single Link
• Cluster similarity = similarity of two most similar
members
Potentially long
and skinny clusters
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
33
Similarity Criterion: Complete Link
• Cluster similarity = similarity of two least similar
members
Tight clusters
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
34
Similarity Criterion: Average Link
• Cluster similarity = average similarity of all pairs
This is perhaps most
widely used similarity
criterion
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
35
Hierarchical Clustering*
*This method was illustrated in Lecture 36,Tables 6.1-MM6.4
•
•
•
Probably most popular clustering algorithm for microarray analysis
First presented in this context by Eisen et al. in 1998
Nodes = genes or groups of genes
Agglomerative (bottom up)
0. Initially each item is a cluster
1. Compute distance matrix
2. Find two closest nodes (most similar
clusters)
3. Merge them
4. Compute distances from merged node to all
others
5. Repeat until all nodes merged into a single
node
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
36
Hierachical Clustering Example:
Using Single Link Criterion to Iteratively
“Combine” Data Points
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
37
Copyright: Russ AltmanBCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
38
Hierarchical Clustering:
Strengths & Weaknesses
• Easy to understand & implement
• Can decide how big to make clusters by choosing
cut level of hierarchy
• Can be sensitive to bad data
• Can have problems interpreting tree
• Can have local minima
Bottom-up is most commonly used method
• Can also perform top-down, which requires
splitting a large group successively
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
39
K-Means Clustering (Model-based)
2nd
Centroid A
Computationally attractive!
1.
2.
3.
4.
5.
Choose random points (cluster
centers or centroids) in k
dimensions
Compute distance from each
data point to centroids
Assign each data point to
closest centroid
Compute new cluster centroid as
average of points assigned to
cluster
Loop to (2), stop when cluster
centroids do not move very
much
Doina Caragea
Initial
Centroid A
Initial
Centroid B
2nd Centroid B
For K = 2
Two features:
f1 (x-coordinate) & f2 (y-coordinate)
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
40
K-Means Clustering Example, for k=2
For simplicity, assume k=2 & objects are 1-dimensional
(Numerical difference is used as distance)
Steps in K-means clustering:
0. Objects: 1, 2, 5, 6, 7
1. Randomly select 5 and 6 as centers (centroids)
2. Calculate distance from points to centroids &
assign points to clusters: {1,2,5} & {6,7}
3. Compute new cluster centroids:
(C1) = 8/3 = 2.7
(C2) = 13/2= 6.5
4. Calculate distance from points to new centroids &
assign data points to new clusters: {1,2} & {5,6,7}
5. Compute new cluster centroids:
(C1) = 1.5
(C2) = 6.0
6. No change? Converged!
=> Final clusters = {1,2} & {5,6,7}
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
1
5
2
1
2
2.7
1
1
2
1.5
2
7
6
5
2.7
1
6
5
2
7
6.5
6
5
5
7
6.5
6
6
11/26/07
7
7
41
K Means Clustering for k=2
A more realistic example
Pick seeds
Assign clusters
Compute centroids
Re-assign clusters
x
x
x
x
Compute centroids
Re-assign clusters
Converged!
From S. Mooney
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
42
K-Means Clustering:
Strengths & Weaknesses
• Fast, O(N)
• Hard to know which K to choose
• Try several and assess cluster
quality
• Hard to know where to seed the
clusters
• Results can change drastically with
different initial choices for
centroids - as shown in example:
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
Example Illustrating
Sensitivity to Seeds
In the above, if start
with B and E as centroids
will converge to {A,B,C}
and {D,E,F}
If start with D and F
Will converge to
{A,B,D,E} {C,F}
11/26/07
43
Choice of K? Helpful to have additional
information to aid evaluation of clusters
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
44
Hierarchical Clustering vs K-Means
Running Time
Assumptions
Doina Caragea
Hierarchical
Clustering
K-Means
Slower
Faster
Requires distance Requires distance
metric
metric
Parameters
None
K (number of
clusters)
Clusters
Subjective
(only a tree is
returned)
Exactly K
clusters
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
45
Clustering vs Classification
• Clustering (unsupervised learning)
• Uses primary data to group measurements, with no
information from other sources
• Classification (supervised learning)
• Uses known groups of interest (from other sources)
to learn features associated with these groups in
primary data and create rules for associating data
with groups of interest
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
46
Compare in Graphical Representation
Clustering
Classification
Apply external labels:
RED group & BLUE group
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
47
Tradeoffs
• Clustering is not biased by previous knowledge, but
therefore needs stronger signal to discover
clusters
• Classification uses previous knowledge, so can
detect weaker signal, but may be biased by
WRONG previous knowledge
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
48
Methods for Classification
•
•
•
•
•
•
K-nearest neighbors
Linear Models
Logistic Regression
Naive Bayes
Decision Trees
Support Vector Machines
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
49
K-Nearest Neighbor (KNN)
• Idea: Use k closest neighbors to label new data
points (e.g., for k = 4)
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
50
Basic KNN Algorithm
INPUT:
• Set of data with labels (training data)
• K
• Set of data needing labels
• Distance metric
1.
For each unlabeled data point, compute distance to all
labeled data
2. Sort distances, determine closest K neighbors (smallest
distances)
3. Use majority voting to predict label of unlabeled data point.
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
51
SLIDES FOLLOWING THIS ONE
WERE NOT SHOWN IN LECTURE
• Some of this is material I discussed or wrote on
blackboard
• It is provided here for your information & for
future reference
• It will not be covered on the Final Exam!
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
52
Microarray Technology
Details re: 2 types of arrays
• cDNA “Slides”
• Short-oligonucleotide “Chips”
A few words about microarray terminology:
• Probes refers to cDNAs or DNA oligos attached to slide or chip
• Target refers to labeled mRNA or cRNA in solution, which is
hybridized to probes attached to slide or chip
Note: this is opposite of terminology used in discussing Southern blots, etc,
in which target is DNA attached to solid matrix & probe is labeled RNA or
cDNA in solution, which is hybridized to targets attached to matrix
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
53
cDNA Microarrays
• Glass slides or similar supports containing cDNA sequences that
serve as probes for measuring mRNA levels in target samples
• cDNAs are arrayed on each slide in a grid of spots.
• Each spot contains thousands of copies of a sequence that
matches a segment of a gene’s coding sequence.
• A sequence and its complement are present in the same spot.
• Different spots typically represent different genes, but some
genes may be represented by multiple spots
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
54
cDNA Microarray Probes
• Expressed Sequence Tags (ESTs) commonly serve as probes on
cDNA microarrays.
• ESTs are small pieces of cDNA sequence (usually 200 to 500
nucleotides long) that has been reverse-transcribed from mRNA
mRNA
AAAAAAAAA...A
cDNA
TTTTTTTTTT...T
EST
Dan Nettleton, ISU
Statistics 416/516X
EST
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
55
cDNA microarray slide 1
spot for
gene 201
cDNA microarray slide 2
spot for
gene 201
GATATG...
GATATG...
spot for
gene 576
spot for
gene 576
...
GATATG...
GATATG...
...
GATATG...
GATATG...
TTCCAG...
TTCCAG...
...
TTCCAG...
TTCCAG...
...
TTCCAG...
TTCCAG...
Each spot contains many copies of a sequence along with its complement (not shown).
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
56
Spotting Probes on the Microarray
8 X 4 Print Head
plate with wells holding probes in solution
All spots of the same color are made at the same time.
All spots in the same sector are made by the same pin.
microarray slide
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
58
cDNA Microarrays to Measure mRNA Levels
• RNA is extracted from a target sample of interest.
• mRNAs are reverse transcribed into cDNA.
• The resulting cDNAs are labeled with a fluorescent dye and are
incubated with the microarray slide.
• Dyed cDNA sequences hybridize to complementary probes
spotted on the array.
• A laser excites the dye and a scanner records an image of the
slide.
• The image is quantified to obtain measures of fluorescence
intensity for each pixel.
• Pixel values are processed to obtain measures of mRNA
abundance for each probe on the array.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
59
cDNA Microarrays to Measure mRNA Levels
(cont.)
• Usually two samples, dyed with different dyes, are hybridized to
a single slide.
• The dyes fluoresce at different wavelengths so it is possible to
get separate images for each dye.
• Images from the scanner are black and white, but it is typical to
display Cy3 images as green and Cy5 images are displayed as red.
• It is common to superimpose the two images, using yellow to
indicate a mixture of green and red.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
60
Problems with cDNA Microarrays: Difficult to
Make Meaningful Comparisons between Genes
• Measures of mRNA levels are affected by several factors that
are partly or completely confounded with genes (e.g., EST
source plate, EST well, print pin, slide position, length of mRNA
sequence, base composition of mRNA sequence, specificity of
probe sequence, etc.).
• Within-gene comparisons of multiple cell types or across
multiple treatment conditions are much more meaningful.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
61
cDNA Microarrays to Measure mRNA Levels:
Step 1: Prepare Microarray Slide & Sample mRNAs
Microarray Slide
ACCTG...G
ACCTG...G
ACCTG...G
TTCTG...A
TTCTG...A
TTCTG...A
GGCTT...C
GGCTT...C
GGCTT...C
ATCTA...A
ATCTA...A
ATCTA...A
ACGGG...T
ACGGG...T
ACGGG...T
CGATA...G
CGATA...G
CGATA...G
Dan Nettleton, ISU
Statistics 416/516X
Sample 1
Spots
(Probes)
Sample 2
Unknown
mRNA
Sequences
(Target)
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
62
cDNA Microarrays to Measure mRNA Levels:
Step 2: Convert mRNA to cDNA & label with Fluorescent Dyes
Sample 1
??????????
??????????
TTCTG...A
TTCTG...A
TTCTG...A
??????????
ACCTG...G
ACCTG...G
ACCTG...G
??????????
??????????
??????????
??????????
??????????
??????????
??????????
GGCTT...C
GGCTT...C
GGCTT...C
ATCTA...A
ATCTA...A
ATCTA...A
Sample 2
??????????
??????????
??????????
CGATA...G
CGATA...G
CGATA...G
??????????
ACGGG...T
ACGGG...T
ACGGG...T
??????????
??????????
??????????
??????????
??????????
??????????
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
63
cDNA Microarrays to Measure mRNA Levels:
Step 3: Mix Labeled cDNA and Hybridize to Slide
Sample 1
TTCTG...A
TTCTG...A
TTCTG...A
GGCTT...C
GGCTT...C
GGCTT...C
ATCTA...A
ATCTA...A
ATCTA...A
ACGGG...T
ACGGG...T
ACGGG...T
Dan Nettleton, ISU
Statistics 416/516X
CGATA...G
CGATA...G
CGATA...G
??????????
??????????
??????????
??????????
??????????
??????????
??????????
??????????
??????????
??????????
ACCTG...G
ACCTG...G
ACCTG...G
BCB 444/544 F07 ISU Dobbs #37- Clustering
Sample 2
11/26/07
64
cDNA Microarrays to Measure mRNA Levels:
Step 5: Excite Dye with Laser, Scan & Quantify Signals
Sample 1
ACCTG...G
TTCTG...A
7652
138
5708
4388
GGCTT...C
ATCTA...A
8566
765
1208
13442
ACGGG...T
CGATA...G
6784
9762
67
239
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
Sample 2
11/26/07
65
Pros/Cons of Spotted cDNA Arrays
• Many sources of variation in the manufacture of
these arrays, print tips, lab, etc.
• Contamination
• Uneven distribution
• Flexible, can put any cDNA on slide
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
66
DNA Oligonucleotide “Chips”
• An oligonucleotide microarray is a microarray whose probes
consist of synthetically created DNA oligonucleotides.
• Probes sequences are chosen to have good and relatively uniform
hybridization characteristics
• A probe is chosen to match a portion of its target mRNA
transcript that is unique to that sequence.
• Oligo probes can distinguish among multiple mRNA transcripts
with similar sequences.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
67
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
68
www.affymetrix.com BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
69
www.affymetrix.com BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
70
Simplified Example
...
gene 1
...
oligo probe
for gene 1
ATTACTAAGCATAGATTGCCGTATA
...gene 2
...
Shared green regions indicate
high degree of sequence similarity
throughout much of the transcript
Dan Nettleton, ISU
Statistics 416/516X
GCGTATGGCATGCCCGGTAAACTGG
BCB 444/544 F07 ISU Dobbs #37- Clustering
oligo probe
for gene 2
11/26/07
71
Oligo Microarray Fabrication
• Oligos can be synthesized and stored in solution for spotting as
is done with cDNA microarrays.
• Oligo sequences can be synthesized on a slide or chip using
various commercial technologies.
• In one approach, sequences are synthesized on a slide using inkjet technology similar to that used in color printers. Separate
cartridges for the four bases (A, C, G, T) are used to build
nucleotides on a slide.
•
Affymetrix uses a photolithographic approach.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
72
Affymetrix GeneChips
• Affymetrix (www.affymetrix.com) manufactures GeneChips,
oligonucleotide arrays.
• Each gene (or sequence of interest or feature) is represented by
multiple short (25-nucleotide) oligo probes.
• Some GeneChips include probes for around 60,000 genes.
• mRNA that has been extracted from a biological sample can be
labeled (dyed) and hybridized to a GeneChip in a manner similar to
that described for cDNA microarrays.
• Only one sample is hybridized to each GeneChip rather than two as
in the case of cDNA microarrays.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
73
Affymetrix Probe Sets
• A probe set is used to measure mRNA levels of a single gene.
• Each probe set consists of multiple probe cells.
• Each probe cell contains millions of copies of one oligo.
• Each oligo is intended to be 25 nucleotides in length.
• Probe cells in a probe set are arranged in probe pairs.
• Each probe pair contains a perfect match (PM) probe cell and
a mismatch (MM) probe cell.
• A PM oligo perfectly matches part of a gene sequence.
• A MM oligo is identical to a PM oligo except that the middle
nucleotide (13th of 25) is intentionally replaced by its
complementary nucleotide.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
74
Affymetrix GeneChips
mRNA reference
sequence
5’
Reference Sequence
3’
Spaced DNA probe
pairs
…TGTGATGGTGGGAATGGGTCAGAAGGGACTCCTATGTGGGTGACGAGGCC…
TTACCCAGTCTTCCCTGAGGATACAC
Perfect match oligo
TTACCCAGTCTTGCCTGAGGATACAC
Mismatch oligo
Probe Set
PM
MM
PM - 25 bases complementary to gene
MM - Middle base is different
Probe Pair
PM
PM
Probe Cell
MM
Probe Cell
MM
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
75
Different Probe Pairs Represent Different
Parts of the Same Gene
gene sequence
Probes are selected to be specific to the target gene
and have good hybridization characteristics.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
76
Obtaining Labeled Target for Affy Chips
1.
RNA  single-stranded cDNA
2.
Single-stranded cDNA  double-stranded cDNA
3.
Double-strand cDNA  labeled single-stranded cRNA
complementary to coding sequence
Number of copies of each sequence gets amplified in conversion
to cRNA.
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
77
Pros/Cons GeneChip Arrays
• Consistent manufacture -> good standardization
• Comparable across experiments
• Design is time-consuming, good for large sets of
chips
• Can only see what is on the chip
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
78
Affymetrix Data Processing Pipeline
Experiment preparation
*.exp file
Image of the scanned
probe array
*.dat file
Analysis output
*.chp file
Probe Cell Intensity file
*.cel file
MicroArray
Suite or other
analysis
software
Dan Nettleton, ISU
Statistics 416/516X
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
79
Reminder:
Why do microarray experiments?
• Compare two (or more) conditions to identify
differentially expressed genes
• Control/treatment
• Disease/normal
• Exploratory analysis
• What genes are expressed in response to drought stress?
• What gene expression changes occur during normal retinal
development?
• Diagnostic & prognostic tool development:
• Can we predicting certain conditions (breast cancer vs
normal)
• Can we identify patterns of gene expression that predict a
patient’s response to treatment/drug?
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
80
Differential Gene Expression
• Are there significant differences in expression
level between the conditions?
• Analysis of Variance (ANOVA)
Mutant 1
Inoculated
Doina Caragea
Mutant 2
Control
Inoculated
BCB 444/544 F07 ISU Dobbs #37- Clustering
Control
11/26/07
81
Exploratory Analysis
• Find patterns in data to see what genes are
expressed under different conditions
• Analysis includes clustering methods
• Used when little or no prior knowledge exists about
the problem
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
82
Classification
• Learn characteristic patterns from a training set
and evaluate with a test set.
• Classify tumor types based on expression
patterns
• Predict disease susceptibility, stages, etc.
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
83
Microarray data analysis
Preprocessing
normalization
scatter plots
Inferential statistics
t-test
ANOVA
Exploratory (descriptive) statistics
distances
clustering
principal components analysis (PCA)
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
84
Pre-processing
Main goal of data preprocessing is to remove any systematic
bias in the data as completely as possible, while preserving
variation in gene expression that occurs because of
Biologically relevant changes in transcription.
Observed differences in gene expression could be due to
transcriptional changes, or they could be caused by
artifacts such as:
• different labeling efficiencies of Cy3, Cy5
•
•
•
•
Doina Caragea
uneven spotting of DNA onto an array surface
variations in RNA purity or quantity
variations in washing efficiency
variations in scanning efficiency
BCB 444/544 F07 ISU Dobbs #37- Clustering
Page 191
85
11/26/07
Inferential statistics
Inferential statistics are used to make inferences
about a population from a sample.
Hypothesis testing is a common form of inferential
statistics. A null hypothesis is stated, such as:
“There is no difference in signal intensity for the gene
expression measurements in normal and diseased
samples.” The alternative hypothesis is that there
is a difference.
We use a test statistic to decide whether to accept or
reject the null hypothesis. For many applications,
we set the significance level a to p < 0.05.
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
Page 199
86
11/26/07
Descriptive statistics
Microarray data are highly dimensional: there are
many thousands of measurements made from a small
number of samples.
Descriptive (exploratory) statistics help you to find
meaningful patterns in the data.
A first step is to arrange the data in a matrix.
Next, use a distance metric to define the relatedness
of the different data points. Two commonly used
distance metrics are:
-- Euclidean distance
-- Pearson coefficient of correlation
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
Page 203
87
11/26/07
Limitations of Microarrays
• Link between proteins and expressed RNA not always
clear
• Difficult to compare between microarray platforms:
• Only see what is on the microarray
• Gene finding is still an art
• Other coding regions, “dark matter” on genome
• But now microarrays for these are being developed, too!
Doina Caragea
BCB 444/544 F07 ISU Dobbs #37- Clustering
11/26/07
88
Download