Talk1.intro - Miami University

advertisement
Introduction to Microarray
Gene Expression
Shyamal D. Peddada
Biostatistics Branch
National Inst. Environmental
Health Sciences (NIH)
Research Triangle Park, NC
Outline of the four talks
A general overview of microarray data

–
–
–
–
Some important terminology and background
Various platforms
Sources of variation
Normalization of data
Analysis of gene expression data - Nominal
explanatory variables

–
–
–
–
Two types of explanatory variables
Scientific questions of interest
A brief discussion on false discovery rate (FDR)
analysis
Some existing methods of analysis.
Outline of the four talks
Analysis of ordered gene expression data

–
–
–
–
–
Common experimental designs
Some existing statistical methods
An example
Demonstration of ORIOGEN
Some open research problems
Analysis of data from cell-cycle experiments

–
–
–
–
Some background on cell-cycle experiments
Modeling the data
Data from multiple experiments
Some open research problem
Talk 1: An overview
of microarray data
To perform statistical analysis
of any given data

It is important to understand all sources of (i)
bias, (ii) variability.
– Some basic understanding of the underlying technology!
– Understand the sampling/experimental design
Some Important Terminology
and Background
Central Dogma of Molecular Biology
Some background terminology:
DNA and RNA

DNA (Deoxyribonucleic acid) - Contains genetic code or
instructions for the development and function living
organisms. It is double stranded.

Four Nucleotides (building blocks of DNA)
– Adenine (A), Guanine (G),
– Thymine (T), Cytosine (C)

Base pairs: (A, T) (G, C)
E.g.
5’ ---AAATGCAT---3’
3’ ---TTTACGTA---5’
Some background terminology:
DNA and RNA

RNA (Ribonucleic acid) - transcribed (or copied) from DNA.
It is single stranded. (Complimentary copy of one of the
strands of DNA)

RNA polymerase - An enzyme that helps in the transcription
of DNA to form RNA.

Four Nucleotides (building blocks of DNA)
– Adenine (A), Guanine (G),
– Uracil (U), Cytosine (C)

Base pairs: (A, U) (G, C)
Some background terminology:
Types of RNA
Types of RNA - (transfer) tRNA,
(ribosomal) rRNA, etc.


mRNA - messenger RNA. Carries information from
DNA to ribosomes where protein synthesis takes
place (less stable than DNA).
Some background terminology:
Oligos

Oligonucleotide - a short segment of DNA
consisting of a few base pairs. In short it is
commonly called “Oligo”.

“mer” - unit of measurement for an Oligo. It is
the number of base pairs. So 30 base pair Oligo
would be 30-mer long.
Some background terminology:
Probes

cDNA - complimentary DNA. DNA sequence that is
complimentary to the given mRNA.
– Obtained using an enzyme called reverse transcriptase.

Probes - a short segment of DNA (about 100-mer
or longer) used to detect DNA or RNA that
compliments the sequence present in the probe.
Some background terminology:
“Blots” - Origins of Microarrays

Southern blot (Edwin Southern, 1975 J. Molec.
Biol.)
– A method used to identify the presence of a DNA
sequence in a sample of DNA.

Western blot (immunoblot)
– to identify a specific protein from a tissue extract.
Some background terminology

Southwestern blot
–

to identify and characterize DNA-binding proteins.
Northern blot
– A method used to study the gene expression from a
sample of mRNA.
Microarrays …
Northern blot Vs Microarray
Microarray
Northern blot
Rate of expression
analysis
Thousands of genes Few genes at a time
at a time
(High throughput)
Automation
Automation
possible
Scope
Allows to explore
Limited
relationships among
several 100’s of
genes at the same
time
Manual
What is a Microarray?

Sequences from thousands of different genes are
immobilized, or attached, at fixed locations.

Spotted, or actually synthesized directly onto the
support.
Microarray Technology

Two color dye array (Spotted array)
– Spotted cDNA microarrays
– Spotted oligo microarrays

Single dye array
– In situ oligo microarrays
Microarray Technology
Spotted Microarrays
Spotted DNA Microarray

Spotted DNA array is typically “home made” so
you need to think about:
– cDNA or Oligo
– Location of the Oligo in a given gene
– Oligo length - number of bp?
Spotted DNA Microarray

Gene expression:
 Red 
Y  log 2 

Green 
– Y < 0; gene is over expressed in green labeled sample
compared to red-labeled sample

– Y = 0; gene is equally expressed in both samples
– Y > 0; gene is over expressed in red-labeled sample
compared to green labeled sample
Single Dye Microarrays
Major Commercial Platforms


More than 50 companies are currently offering various DNA
microarray platforms, reagents and software
Affymetrix dominated the marker for many years
Manufacturer
Code
Protocol
Platform
# of Probes
Applied Biosystems ABI
One-color microarray Human Genome Survey Microarray v2.0
32878
Affymetrix
AFX
One-color microarray HG-U133 Plus 2.0 GeneChip
54675
Agilent*
AG1
One-color microarray Whole Human Genome Oligo Microarray, G4112A
43931
Eppendorf
EPP
One-color microarray DualChip Microarray
GE Healthcare
GEH
One-color microarray CodeLink Human Whole Genome, 300026
54359
Illumina
ILM
One-color microarray Human-6 BeadChip, 48K v1.0
47293
*Agilent has one and two-color microarray platform
294
Affymetrix GeneChip

Each gene is represented by 11 to 20 oligos of 25-mers

Probe: An oligo of 25-mer

Probe Pair: a PM and MM pair

Perfect match (PM): A 25-mer complementary to a reference
sequence of interest (part of the gene)

Mismatch (MM): same as PM with a single base change for the
middle (13th) base (G <-> C, A <-> T)

Probe set: a collection of probe-pairs (11 to 20) related to a
fraction of gene
Affymetrix call for the presence of a signal

Affymetrix detection algorithm uses probe pair
intensities to obtain detection p-value
Using this p-value they decide whether the signal
is
– “ present”, “marginal” or “absent”
Affy call

Detection of p-value
– Calculate Kendall’s tau T for each probe pair

T = (PM-MM) / (PM+MM)
– Determine the statistical significance of the gene by
computing the p-value.
Affy call
Ref: Affymetrix Technical Manual
Affymetrix Vs Illumina
Ref: Pan Du & Simon Lin
Microarray Data Analysis
Why Normalize Data?

To “calibrate”/adjust data so as to reduce or
eliminate the effects arising from variation in
technology and other sources rather than due to
true biological differences between test groups.
Sources of bias/variation

Tissue or cell lines

mRNA
– It can degrade over time - so there is a potential batch
effect if portions of experiment are performed at
different times
– Purity and quantity



Dye color effect (spotted arrays)
Variation due to technology - is substantially reduced with
improved technology
Etc.
A useful graphical representation of data


Data matrix:

Let
X mxn , Rank (X)  r  min( m,n)  n.
m :# genes,n # samples.
S : m m sample covariance matrix.

A useful graphical representation of data

Let its spectral decomposition be given by
S  '
where
 : m  r matrix of eigenvectors

 : r  r diagonal matrix of non - zero eigenvalues
1   2  ...   r  0.
A useful graphical representation of data

Then
Z  ' X : r  n matrix of " eigengenes"
th
Z i  i ' X : i eigengene.

Plot


Z1 vs Z 2
Common Normalization Methods

Internal Control Normalization

Global Normalization

Linear Normalization (Spotted arrays)

Non-linear Normalization Method (Spotted arrays) LOWESS curve.

ANOVA

COMBAT (for batch effect)
Internal control normalization
(Housekeeping gene(s))
 Expression of each gene is measured relative to
the average of house keeping genes.
– Basic assumption: Expression of housekeeping genes
does not change.
 Disadvantage:
– House keeping genes may be highly expressed sometimes.
Unexpected regulation of house keeping gene(s) leads to
misinterpretation
Global Normalization
 Basic assumption
– Mean/Median expression ratio of all monitored
mRNAs is constant across a chip.
Regression of
R 
log  on a constant 
G 
In simple terms the log ratios are corrected by a
common “mean” or “median”

This method can also be applied to single Dye data
Linear Normalization
(for spotted arrays)
 Basic assumption
– Mean/Median expression ratio of all monitored
mRNAs depends upon the average intensity
Regression of

R 
log  on (1/2) log(RG)
G 
Non-Linear Normalization
(for spotted arrays)
 Basic assumption
– Mean/Median expression ratio of all monitored mRNAs
depends upon the average intensity
Regression of
Where
R 
log  on C (log(RG))
G 
C (log(RG)) is estimated by the robust scatter plot
smoother LOWESS (Locally WEighted Scatterplot Smoothing)


Analysis of Variance (ANOVA)
 Standard Analysis of Variance model
– Response variable - Gene expression
– Explanatory variables:
– Dye color
– Batch
– Other potential effects?
 Advantage: Statistically significant
genes can be identified while controlling for the
various experimental conditions/factors.
Some important experimental designs

Pooled Samples versus Separate samples
– Sometimes there may not be sufficient
biological sample/specimen from a given animal.
In such cases biological samples are pooled
from several identical animals to form a sample.
An example of a pooling design
(for each treatment group)
Subjects
Pool
Observations
(Microarray chips)
The pooling design
Subjects
Pool
Observations
(Microarray chips)
9
3
(3 per pool)
6
More generally:
n
p
m
(r=n/p per pool)
The standard design
Subjects
9
# Pool
9
(r=1)
More generally:
n
p=n
(r=1)
Observations
(Microarray chips)
9
m=n
Some issues
•
•
•
What are the underlying parameters?
Effect of pooling on power.
The basic assumption. Validity of the
assumption.
Parameters
Total variation in the expression of a gene can be
decomposed in to:
•
–
–
•
•
•
•
Biological variation
Technical variation
Biological samples (n)
Number of pools (p)
Biological samples per pool (r=n/p)
Observed number of samples (e.g. microarrays) (m)
Some comments about pooling
Variance of the estimated mean expression of a gene depends on:
–
–
–
–
–
number of pools (p)
number of bio samples per pool (r)
number of arrays (m)
biological variation
Technical variation.
Pooling works well when the biological variation in the gene
expression is substantially larger than the technical variation.
Power comparisons
# Bio
#Micro
Pool size
5/group 5/group
6/group 6/group
1 (Standard design)
1 (Standard design)
6/group 3/group
8/group 4/group
10/group 5/group
2 (i.e 3 pools/group)
2 (i.e. 4 pools/group)
2 (i.e. 5 pools/group)
-
Zhang and Gant (2005)
Power
0.81
0.95
0.30
0.80
0.98
Power comparisons
Conditions of the simulation study:
Biological variation is 4 times the technical variation.
False positive rate is 0.001.
Detect 2-fold expression.
Data are normally distributed.
A fundamental assumption
Biological averaging:
Suppose an experiment consists of pooling “r” samples. Then
the expression of a gene in the pooled sample is assumed to
be the average of the gene’s expression in the “r” samples.
This assumption need not be true especially if the expression
values are transformed non-linearly.
Some important experimental designs

Reference designs (Spotted array)
– Each treatment sample is hybridized against a common
reference control.

Loop designs (Spotted array)
– Suppose we have a control and three experimental groups
A, B and C. Then hybridize Control and A, A with B, B
with C and C with A.
Data Analysis - Preliminaries

Normalization

Transformation of data (usual methods)
–
Perhaps first fit ANOVA and plot the residuals




Log transformation
Square root
More generally, Box-Cox family of transformations
Identify potential outliers in the data (again, perhaps use the
residuals)
Data Analysis

Method of Analysis depends upon the scientific
question of interest.

In the next three lectures we describe several
general methods and illustrate some using real
data!
Download