Introduction of Microarray

advertisement
Microarray
Yuki Juan
NTUST
May 26, 2003
Content






Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
The Biology Background of Microarray




The central dogma of life forms
DNA
RNA
Monitoring the expression of genes
Central Dogma

DNA Replication
--ACGCGA---TGCGCT--

RNA Transcription
--UGCGCU--

Protein Translation
--CYSALA--
DNA
replication
transcription
DNA
RNA
translation
Protein
DNA

The double helix


Nucleotide


A, T, G, C
Base pair



stable
A–T
G–C
Oligonucleotide

short DNA (tens of
nucleotides, or bps)
(http://www.nhgri.nih.gov/)
DNA Strand

DNA has canonical orientation


read from 5’ to 3’
antiparallel: one strand has direction
opposite to its complement’s
5’ …
3’ …
TACTGAA … 3’
ATGACTT … 5’
Hydrogen Bond Makes DNA Binding
Specifically
Hydrogen bond
5’
3’
5’
3’
Hydrogen Bond Makes DNA Binding
Specifically

The force between base pair is
hydrogen bond, This force let
A-T(U), C-G can specifically match
together.
RNA
replication
transcription
DNA
RNA
translation
Protein
RNA

Types



messenger RNA
ribosomal RNA (rRNA)
transfer RNA (tRNA)
Gene is expressed by transcribing DNA
into single-stranded mRNA
RNA (Detailed)
(http://www.nhgri.nih.gov/)
Reverse Transcription
replication
transcription
DNA
translation
RNA
Protein
Reverse Transcription
By reverse transcriptase, we can convert RNA into cDNA.
The Southern Blot
Basic DNA detection technique that has
been used for over 30 years, known as
Southern blots:





A “known” strand of DNA is deposited on a solid
support (i.e. nitocellulose paper)
An “unknown” mixed bag of DNA is labelled
(radioactive or flourescent)
“Unknown” DNA solution allowed to mix with
known DNA (attached to nitro paper), then
excess solution washed off
If a copy of “known” DNA occurs in “unknown”
sample, it will stick (hybridize), and labeled DNA
will be detected on photographic film
mRNA Represent Gene Function



When measure the level of a mRNA, we
are monitoring the activity of a gene.
Thus, if we can understand all the level of
mRNAs, we can study the expression of
whole genome.
Microarray takes the advantage of getting
over 10000 of blotting data in a single
experiment, which makes monitoring the
genome activity possible.
Content






Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
Design of Microarray



Microarray in different context
The idea of microarray
Main type of array chips
mRNA Levels Compared in Many
Different Contexts





Different tissues, same organism (brain v.
liver)
Same tissue, same organism (tumor v. nontumor)
Same tissue, different organisms (wt v.
mutant)
Time course experiments (development)
Other special designs (e.g. to detect spatial
patterns).
Idea of Microarray
Cell A
Cell B
Labeled cDNA
from geneX
Hybridizaton to chip
Spot of geneX with
complementary sequence
of colored cDNA
This spot shows red color after scanning.
Over 10,000 Hybridization Could Be
Down at One Time
Several Types of Arrays

Spotted DNA arrays



Affymetrix gene chips


Developed by Pat Brown’s lab at Stanford
PCR products of full-length genes (>100nt)
Photolithography technology from
computer industry allows building many
25-mers
Ink-jet microarrays from Agilent


25-60-mers “printed directly on glass
slides
Flexible, rapid, but expensive
Array Fabrication Spotting
•
•
Use PCR to amplify DNA
Robotic "pen" deposits DNA at defined
coordinates
•
•
approximately 1-10 ng per spot
Experimentation with oligos (40, 70 bp)
This machine can make 48 microarrays
simultaneously.
Array Fabrication Photolithography
•
Light activated synthesis
•
•
•
synthesize oligonucleotides on glass slides
107copies per oligo in 24 x 24 um square
Use 20 pairs of different 25-mers per
gene
•
Perfect match and mismatch
Array Fabrication Photolithography
Affymetrix Microarrays
Raw image
1.28cm
50um
~107 oligonucleotides,
half perfectly match mRNA (PM),
half have one mismatch (MM)
Raw gene expression is intensity
difference: PM - MM
Agilent cDNA microarray and
oligonucelotides microarray


Agilent delivering printed 60-mer
microarrays in addition to 25-mer formats.
The inkjet process uses standard
phosphoramidite chemistry to deliver
extremely small volumes (picoliters) of the
chemicals to be spotted.
Content





Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
The Workflow of Microarray
sample
Plate
Plate Preparation
RNA extraction
Array Fabrication
cDNA synthesis
and labeled
Array
Hybridization
Hybridized Array
Scanning
Labeled cDNA
cDNA Synthesis And Directly Labeling
Cy3 and Cy5 cDNA Hybridization On
To The Chip
e.g. treatment / control
normal / tumor tissue
Sample loading
1.Loading from the corner of the
cover slip
It is time consuming and easily
producing bubbles.
1
2
Sample loading
3
Sample loading
2. Loading sample at the center
of array then put the slip
smoothly
Faster, and have lower chance of
bubble producing then the last one.
3. Loading sample at the side of
the array then put the slip on.
Solution would attach to the slip right
after the slip contact with it, and
would diffuse with the movement of
slip when we slowly move down.
Scan
Green: down regulate
Red: up regulate
Yellow: equal level
Content






Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
Image analysis



To find a spot
Convert feature into numeric data
Image normalization
The Algorithms
1. Find spots: Finds the location of each spot on
the microarray.
2. Cookie cutter algorithm:
(1).Suppose the distribution of pixels vs
intensity is Gaussian curve
(2).Using SD or IQR to identify the feature and
background of each spot
(3).Calculates statistics for the pixel population
Interquartile Range(IQR)
D
K=IQR/2 1.42 IQR
Boundary for
rejection
25
%
50
%
IQR
75
%
Boundary for
rejection
Feature
or cookie
D
Exclusion
zone
Local
background
Data Quality



Irregular size or
shape
Irregular
placement
Low intensity
indistinguishable
saturated
bad print



Saturation
Spot variance
Background
variance
miss alignment
artifact
Convert Feature Into Numeric Value
Green
Green b.g.-corrected Red b.g.-corrected
background
(R. b.g.-c)/(G. b.g.Red intensity
Green
c) Systematic name
intensity
Red b.g.
Gene function
A_1_1
A_1_2
A_1_3
A_1_4
A_1_5
A_1_6
A_1_7
A_1_8
A_1_9
A_1_10
A_1_11
A_1_12
A_1_13
A_1_14
A_1_15
A_1_16
A_1_17
A_1_18
A_1_19
A_1_20
A_1_21
A_1_22
A_1_23
A_1_24
A_1_25
A_1_26
Ctrl
Ctrl
Ctrl
Data
Data
Data
D x A - PSL
B kgd
sDxA
D x A - PSL
B kgd
sDxA
Ratio (sDxA): Data /
59358.75
512.92 58845.83 50953.13 1779.913 49173.22 0.835628 YAL003W
1209.19
512.92
696.271 2522.345 1779.913 742.4323 1.066298 YAR053W
1948.2
512.92
1435.28 3100.152 1779.913 1320.239 0.919848 YBL078C
4940.806
512.92 4427.886 6670.604 1779.913 4890.691 1.104521 YAL008W
1485.59
512.92
972.671 2916.086 1779.913 1136.173 1.168096 YAR062W
32642.03
512.92 32129.11 42304.13 1779.913 40524.22 1.261293 YBL087C
6919.441
512.92 6406.521 8540.246 1779.913 6760.333 1.055227 YAL014C
2698.301
512.92 2185.382
4314.47 1779.913 2534.557 1.159778 YAR068W
7167.958
512.92 6655.038 7379.286 1779.913 5599.373 0.841374 YBL100C
5470.062
512.92 4957.142 6953.799 1779.913 5173.886 1.043724 YAL025C
27879.49
512.92 27366.57
33746.9 1779.913 31966.99 1.168103 YBL002W
2589.613
512.92 2076.693 4385.568 1779.913 2605.655 1.254713 YBL107C
6196.245
512.92 5683.326 8840.475 1779.913 7060.562 1.242329 YDR044W
34737.1
512.92 34224.18 36129.62 1779.913
34349.7 1.003668 YDR134C
34035.35
512.92 33522.43 27128.53 1779.913 25348.62 0.756169 YDR233C
1638.381
512.92 1125.461 2988.042 1779.913 1208.129 1.073453 YDR048C
3873.718
512.92 3360.799 4955.141 1779.913 3175.228 0.944784 YDR139C
2433.625
512.92 1920.706 3502.406 1779.913 1722.493 0.896802 YDR252W
1800.736
512.92 1287.816 3011.855 1779.913 1231.942 0.956613 YDR053W
1296.689
512.92
783.77 2636.549 1779.913 856.6356 1.092968 YDR149C
3453.24
512.92
2940.32 4968.026 1779.913 3188.113 1.084274 YDR260C
10731.55
512.92 10218.63 9307.246 1779.913 7527.333 0.736629 YDR056C
6191.309
512.92
5678.39 8808.398 1779.913 7028.485
1.23776 YDR152W
3589.998
512.92 3077.078 4420.744 1779.913 2640.831 0.858227 YDR269C
27568.34
512.92 27055.42
20856.2 1779.913 19076.29 0.705082 YGL189C
1956.182
512.92 1443.262 3150.716 1779.913 1370.803 0.949795 YGL261C
Ctrl
translation elongation factor eef1beta
hypothetical protein
essential for autophagy
protein of unknown function
putative pseudogene
60s large subunit ribosomal protein l23.e
strong similarity to hypothetical protein yhr214w
questionable orf
nuclear viral propagation protein
histone h2b.2
hypothetical protein
coproporphyrinogen iii oxidase
strong similarity to flo1p, flo5p, flo9p and ylr110
similarity to hypothetical protein ydl204w
questionable orf
ubiquitin-like protein
strong similarity to egd1p and to human btf3 pro
questionable orf
questionable orf
hypothetical protein
hypothetical protein
weak similarity to c.elegans hypothetical protein
questionable orf
40s small subunit ribosomal protein s26e.c7
strong similarity to members of the srp1/tip1 fa
Data Normalization

Normalize data to correct for
variances






Dye bias
Location bias
Intensity bias
Pin bias
Slide bias
Control vs. non-control spots
Data Normalization
Uncalibrated, red light
under detected
Calibrated, red and green
equally detected
Data Normalization

Assumptions

Overall mean average ratio should be 1


Most genes are not differentially
expressed
Total intensity of dyes are equivalent
Intensity Dependent Normalization
After Normalization
Additional Normalization

Pin dependent



Similar to intensity dependent fit.
Compute individual lowess fits for each
pin group
Within slide normalization


After pin dependent normalization, log
ratios for each pin are centered around
0
Scale variance for each pin

Uses MAD (median absolute deviation)
Additional Normalization

Dye swap


Combine relative expression levels
without explicit normalization
Compute lowess fit for
log2(RR’/GG’)/2 vs. log2(A + A’)/2

Normalized ratio is
log2(R/G) - c(A)
where c(A) is the lowess prediction
Content






Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
Data analysis





Data filtering
Fold change analysis
Classification
Clustering
Future direction
Microarray Data Classification
Microarray chips
Images scanned by laser
Value
193
-70
144
33
318
1764
1537
1204
707
Datasets
New
sample
Prediction:
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Data Mining
and analysis
Class Sno D26528 D63874 D63880 …
ALL
2
193
4157
556
ALL
3
129 11557
476
ALL
4
44 12125
498
ALL
5
218
8484
1211
AML
51
109
3537
131
AML
52
106
4578
94
AML
53
211
2431
209
…
The Threshold of Spots

Filtering - remove genes with insufficient
variation





Remove insufficient spot:
saturated, None uniform, too high
background…
Remove extreme signal:
e.g. MaxVal - MinVal < 500 and
MaxVal/MinVal < 5
Statistical filtering (e.g. p-value<0.01)
biological reasons
feature reduction for algorithmic
Microarray Data Analysis Types
Different
 Fold
gene expression
change analysis
Classification
(Supervised)
 identify
disease
 predict outcome / select best treatment
Clustering
 find
(Unsupervised)
new biological classes / refine
existing ones
 exploration
…
Differential Gene Expression

n-fold change




n typically >= 2
May hold no biological relevance
Often too restrictive
2 expression


Calculate standard deviation 
Genes with expression more than 2
away are differentially expressed
Fold Changes-Scatter Plot
72
(raw)
10000
1000
100
10
1
0.1
21
72 (control)
0.01
1
10
100
1000
10000
Fold Changes Table
Description
Genebank
accession
No.
6h
24 h
48 h
72 h
Fold Change Fold Change Fold Change Fold Change
Group 1
caspase 10, apoptosis-related cysteine protease U60519
-
-
-
0.471
CASP8 and FADD-like apoptosis regulator
U97075
nucleoside diphosphate kinase type 6 (inhibitor
of p53-induced apoptosis-alpha)
AF051941
-
-
-
0.355
-
-
-
0.376
Group 2
caspase 3, apoptosis-related cysteine protease
U13738
-
2.301
-
-
CASP8 and FADD-like apoptosis regulator
AF005775
-
2.272
-
-
U60521
-
-
2.519
-
Z48810
2.615
-
2.796
2.819
Group 3
caspase 9, apoptosis-related cysteine protease
Group 4
caspase 4, apoptosis-related cysteine protease
Group 5
inhibitor of apoptosis protein
AAF19819
-
-
-
5.249
caspase 7, apoptosis-related cysteine protease
U67319
-
-
-
2.19
caspase 4, apoptosis-related cysteine protease
U28976
-
-
-
2.603
AF015450
-
-
-
6.912
Group 6
23
CASP8 and FADD-like apoptosis regulator
Classification: Multi-Class
Similar Approach:
select top genes most correlated to each
class
select best subset using cross-validation
build a single model separating all classes
Advanced:
 build
separate model for each class vs. rest
 choose model making the strongest prediction
Popular Classification Methods
Decision
 find
Trees/Rules
smallest gene sets, but also false positives
Neural
 work
Nets well if number of genes is reduced
SVM
 good
accuracy, does its own gene selection,
hard to understand
K-nearest
neighbor - robust for small
number genes
Bayesian nets - simple, robust
Multi-class Data Example
Brain
data, Pomeroy et al 2002,
Nature (415), Jan 2002
 42
examples, about 7,000 genes, 5
classes
Selected
top 100 genes most
correlated to each class
Selected best subset by testing
1,2, …, 20 genes subsets, leave-oneout x-validation for each
Classification – Other Applications
Combining
clinical and genetic data
Outcome / Treatment prediction
 Age,
Sex, stage of disease, are useful
 e.g. if Data from Male, not Ovarian
cancer
Clustering
Goals
Find natural classes in the data
Identify new classes / gene
correlations
Refine existing taxonomies
Support biological analysis /
discovery
Different Methods
 Hierarchical
clustering, SOM's, etc
SOM clustering
SOM
- self organizing maps
Preprocessing
 filter
away genes with insufficient
biological variation
 normalize gene expression (across
samples) to mean 0, st. dev 1, for each
gene separately.
Run
SOM for many iterations
Plot the results
SOM & K Mean By GeneSpring
27
Hierarchical Clustering

The most popular hierarchical clustering
method used in microarray data analysis
is the so called agglomerative method

works with the data in a bottom-up manner.


Initially, each data point forms a cluster and the
algorithm works through the cluster sets by
repeatedly merging the two which are the most
similar or have the shortest distance.
algorithm involves the computation of the
distance or similarity matrix

O(N^2) complexity and thus is not very
efficient.
Hierarchical clustering
Genomic Reprogramming in Response to Oxidant
minutes
0 10 20 40 60 120
One-third of genome expression is
transiently reprogrammed
6218 genes
Fold repression
>9 >6
>3
Fold induction
1:1
>3
>6
>9
Future directions
 Algorithms
optimized for small samples (the
no. of samples will remain small for many
tasks)
 Integration with other data
 biological networks
 medical text
 protein data
 cost-sensitive classification algorithms
 error cost depends on outcome (don’t
want to miss treatable cancer), treatment
side effects, etc.
Integrate biological knowledge when
analyzing microarray data (from Cheng
Li, Harvard SPH)
Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25
Content






Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
Microarray Potential Applications

Biological discovery





new and better molecular diagnostics
new molecular targets for therapy
finding and refining biological pathways
Mutation and polymorphism detection
Recent examples



molecular diagnosis of leukemia, breast
cancer, ...
appropriate treatment for genetic signature
potential new drug targets
Microarray Limitations




Cross-hybridization of sequences with high identity
Chip to chip variation
True measure of abundance?
Does mRNA levels reflect protein levels?





Generally, do not “prove” new biology - simply suggest genes
involved in a process, a hypothesis that will require traditional
experimental verification.
What fold change has biological relevance?
Need cloned EST or some sequence knowledge -- rare
messages may be undetected
Expensive!! Not every lab can afford experiment
repeat.
The real limitation is Bioinformatics
Additional Information

Review papers on microarray




Genomics, gene expression and DNA
arrays (Nature, June 2000)
Microarray - technology review (Natural
Cell Biology, Aug. 2001)
Magic of Microarray (Scientific
American, Feb. 2002)
Molecular biology tutorial

http://www.lsic.ucla.edu/ls3/tutorials/
Biological data retrieval systems: Entrez
http://www.ncbi.nlm.nih.gov/Database/index.html
1.
A retrieval system for searching a number of inter-connected
databases at the NCBI. It provides access to:
 PubMed: The biomedical literature (Medline)
 Genbank: Nucleotide sequence database
 Protein sequence database
 Structure: three-dimensional macromolecular structures
 Genome: complete genome assemblies
 PopSet: population study data sets
 OMIM: Online Mendelian Inheritance in Man
 Taxonomy: organisms in GenBank
 Books: online books
 ProbeSet: gene expression and microarray datasets
 3D Domains: domains from Entrez Structure
 UniSTS: markers and mapping data
 SNP: single nucleotide polymorphisms
 CDD: conserved domains
2. Entrez allows users to perform various searches.
Download