Various Career Options Available

advertisement
Introduction to Microarray
Dr G. P. S. Raghava
Molecular Biology Overview
Cell
Nucleus
Chromosome
Protein
Gene (mRNA),
single strand
Gene (DNA)
Measuring Gene Expression
Idea: measure the amount of mRNA to see which
genes are being expressed in (used by) the cell.
Measuring protein would be more direct, but is
currently harder.
(RT)
The Goals



Basic Understanding
– Arrays can take a snap shot of which subset of genes in a cell is actively
making proteins
– Heat shock experiments
Medical diagnosis
– Microarrays can indicate where mutations lie that might be linked to a
disease. Still others are used to determine if a person’s genetic profile
would make him or her more or less susceptible to drug side effects
– 1999 – A genechip containing 6800 human genes was used distinguish
between myeloid leukemia and lympholastic leukemia using a set of 50
genes that have different activity levels
Drug design
– Pharmaceutical firms are in a rush to translate the human genome
results into new products
 Potential profits are huge
 First, though, they must figure out what the genes do, how they
interact, and how they relate to diseases.
– Evaluation, Specificity, Response
Microarray Potential
Applications

Biological discovery
– new and better molecular diagnostics
– new molecular targets for therapy
– finding and refining biological pathways

Recent examples
– molecular diagnosis of leukemia, breast
cancer, ...
– appropriate treatment for genetic signature
– potential new drug targets
History
1980s: antibody-based assay (protein chip?)
~1991: high-density DNA-synthetic chemistry
(Affymetrix/oligo chips)
~1995: microspotting (Stanford Univ/cDNA chips)
replacing porous surface with solid surface
replacing radioactive label with fluorescent label
improvement on sensitivity
What is a DNA Microarray?
genes or gene fragments
attached to a substrate (glass)
Tens of thousands of spots/genes
=entire genome in 1 experiment
A Revolution in Biology
Hybridized slide
Two dyes
Image analyzed
Gene Expression Microarrays
The main types of gene expression
microarrays:
 Short oligonucleotide arrays
(Affymetrix);
 cDNA or spotted arrays
(Brown/Botstein).
 Long oligonucleotide arrays (Agilent
Inkjet);
Terms/Jargons
Stanford/cDNA chip Affymetrix/oligo
chip
 one slide/experiment
 one chip/experiment
 one spot
 1 gene => one spot  one
probe/feature/cell
or few spots(replica)
 control: control spots  1 gene => many
probes
(20~25
 control: two
mers)
fluorescent dyes
 control: match and
(Cy3/Cy5)
mismatch cells.
Affymetrix Microarrays
Raw image
1.28cm
50um
~107 oligonucleotides,
half Perfectly Match mRNA (PM),
half have one Mismatch (MM)
Raw gene expression is intensity
difference: PM - MM
DNA Microarrays



Each probe consists of thousands of strands of identical
oglionucleotides
– The DNA sequences at each probe represent
important genes (or parts of genes)
Printing Systems
– Ex: HP, Corning Inc.
– Printing systems can build lengths of DNA up to 60
nucleotides long
– 1.28 x 1.28+ cm glass wafer
 Each “print head” has a ~100 m diameter and
are separated by ~100 m. ( 5,000 – 20,000
probes)
Photolithographic Chips
– Ex: Affymetix
– 1.28 x 1.28 cm glass/silicon wafer
 24 x 24 m probe site ( 500,000 probes)
– Lengths of DNA up to 25 nucleotides long
– Requires a new set of masks for each new array
type
GeneChip
The Process
Poly-A
RNA
Cells
AAAA
10% Biotin-labeled Uracil
Antisense cRNA
IVT
L
L L
(In-vitro
Transcription)
cDNA
Fragment (heat, Mg2+)
Labeled
fragments
L
L
L
Hybridize
Wash/stain
Scan
Hybridization and Staining
Biotin
Labeled cRNA
GeneChip
Hybridized Array
L
L
+
L
L
L
L
L
+
L
L
L
L
L
SAPE
Streptavidinphycoerythrin
Microarray Data

First, the Problems:
1. The fabrication process is not
error free
2. Probes have a maximum
length 25-60 nucleotides
3. Biologic processes such as
hybridization are stochastic
4. Background light may skew
the fluorescence
5. How do we decide if/how
strongly a particular gene is
being expressed?

Solutions to these problems are
still in their infancy
Affymetrix “Gene chip”
system






Uses 25 base oligos synthesized in place on a
chip (20 pairs of oligos for each gene)
RNA labeled and scanned in a single “color”
– one sample per chip
Can have as many as 20,000 genes on a chip
Arrays get smaller every year (more genes)
Chips are expensive
Proprietary system: “black box” software, can
only use their chips
cDNA Microarray Technologies





Spot cloned cDNAs onto a glass microscope
slide
– usually PCR amplified segments of plasmids
Label 2 RNA samples with 2 different colors of
flourescent dye - control vs. experimental
Mix two labeled RNAs and hybridize to the chip
Make two scans - one for each color
Combine the images to calculate ratios of
amounts of each RNA that bind to each spot
cDNA microarrays
Compare the genetic expression in two samples of cells
PRINT
cDNA from one
gene on each spot
SAMPLES
cDNA labelled red/green
e.g. treatment / control
normal / tumor tissue
HYBRIDIZE
Add equal amounts of
labelled cDNA samples to
microarray.
SCAN
Laser
Detector
“Long Oligos”
Like cDNAs, but instead of using a
cloned gene, design a 40-70 base probe
to represent each gene
 Relies on genome sequence database
and bioinformatics
 Reduces cross hybridization
 Cheaper and possibly more sensitive
than Affy. system

Images from scanner

Resolution
– standard 10m [currently, max 5m]
– 100m spot on chip = 10 pixels in diameter

Image format
– TIFF (tagged image file format) 16 bit (65’536 levels of grey)
– 1cm x 1cm image at 16 bit = 2Mb (uncompressed)
– other formats exist e.g.. SCN (used at Stanford University)

Separate image for each fluorescent sample
– channel 1, channel 2, etc.
Processing of images

Addressing or gridding
– Assigning coordinates to each of the spots

Segmentation
– Classification of pixels either as foreground or as
background

Intensity determination for each spot
– Foreground fluorescence intensity pairs (R, G)
– Background intensities
– Quality measures
Images in analysis software

The two 16-bit images (Cy3, Cy5) are compressed into 8bit images

Display fluorescence intensities for both wavelengths
using a 24-bit RGB overlay image

RGB image :
– Blue values (B) are set to 0
– Red values (R) are used for Cy5 intensities
– Green values (G) are used for Cy3 intensities

Qualitative representation of results
Images : examples
Pseudo-colour overlay
Cy3
Cy5
Spot colour
Signal strength
Gene expression
yellow
Control = perturbed
unchanged
red
Control < perturbed
induced
green
Control > perturbed
repressed
Quantification of expression
For each spot on the slide we calculate
Red intensity = Rfg - Rbg
(fg = foreground, bg = background) and
Green intensity = Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green
intensity)
Gene Expression Data
On p genes for n slides: p is O(10,000), n is
O(10-100), but growing,
Slides
Genes
1
2
3
4
5
slide 1
slide 2
slide 3
slide 4
slide 5
…
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
Gene expression level of gene 5 in slide 4
=
Log2( Red intensity / Green intensity)
These values are conventionally displayed
on a red (>0) yellow (0) green (<0) scale.
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
Quality control (-> Flag)

How good are foreground and background
measurements ?
– Variability measures in pixel values within each
spot mask
– Spot size
– Circularity measure
– Relative signal to background intensity
– Dapple:



b-value : fraction of background intensities less than the
median foreground intensity
p-score : extend to which the position of a spot deviates
from a rigid rectangular grid
Flag spots based on these criteria
Replication

Why?
• To reduce variability
• To increase generalizability

What is it?
• Duplicate spots
• Duplicate slides


Technical replicates
Biological replicates
Practical Application of DNA Microarrays

DNA Microarrays are used to study gene activity (expression)
– What proteins are being actively produced by a group of cells?
 “Which genes are being expressed?”

How?
– When a cell is making a protein, it translates the genes (made of
DNA) which code for the protein into RNA used in its production
– The RNA present in a cell can be extracted
– If a gene has been expressed in a cell
 RNA will bind to “a copy of itself” on the array
 RNA with no complementary site will wash off the array
– The RNA can be “tagged” with a fluorescent dye to determine its
presence

DNA microarrays provide a high throughput technique for quantifying the
presence of specific RNA sequences
Analysis and Management of Microarray
Data

Magnitude of Data
– Experiments






50 000 genes in human
320 cell types
2000 compunds
3 times points
2 concentrations
2 replicates
– Data Volume


4*1011 data-points
1015 = 1 petaB of Data
Thanks
Download