Probe design for microarrays using OligoWiz

advertisement
Probe design
for microarrays
using
OligoWiz
The DNA Array Analysis Pipeline
Question
Experimental Design
Array design
Probe design
Sample Preparation
Hybridization
Buy Chip/Array
Image analysis
Normalization
Expression Index
Calculation
Comparable
Gene Expression Data
Statistical Analysis
Fit to Model (time series)
Advanced Data Analysis
Clustering
Meta analysis
PCA
Classification
Survival analysis
Promoter Analysis
Regulatory Network
Probe design
for microarrays
-What is a Probe
-Different Probe Types
-OligoWiz
-Probe Design
-Cross Hybridization and Complexity
-Affinity
-Position
An Ideal Probe
must
- Discriminate well between its intended
target and all other targets in the target
pool
- Detect concentration differences under
the hybridization conditions
Probe Type
comparisons
Advantages
Disadvantages
PCR products
Inexpensive to setup
Handling problems
No probe selection
Uneven probe concentrations
Spotted Oligos
Allows for probe selection
Easy to handle
Expensive in small scale
In situ
synthesized
oligonucleotide
arrays
Allows for probe selection
Fast to setup
Multiple probes per gene
Expensive in large scale
Custom Microarrays
When on virgin ground
Some technologies available for custom arrays
Spotted arrays
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
in situ synthesized
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see t his picture.
NimbleExpressェ Array Program
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
OligoWiz a Tool
for flexible probe design
How does it work?
Probe selection
1. Optimal melting temperature (Tm) for the
DNA:DNA or RNA:DNA hybridization for probes
of the given length is determined.
2. Optimal probe length are determined for all
possible probes along the input sequence
3. Five scores are calculated for each of these
probes
4. Best probes are selected based on a
weighted sum of these scores
The five scores
In order of importance
Cross-hybridization
∆Tm - (deviation from optimal Tm)
Folding - (probe self annealing)
Position - (3’ preference)
Low-complexity
All scores are normalize to a value
between 0.0 (bad) and 1.0 (best).
How to Avoid
cross-hybridization
From Kane et al. (2000) we learn that a 50’mer probe can
detect significant false signal from a target that has
>75-80% homology to a 50’mer oligo
or a continuous stretch of >15 complementary bases
If we have substantial sequence information on the given
organism, we can try to avoid this by choosing oligos that are
not similar to any other expressed sequences.
Probe Specificity
Hughes et al. 2001
Mapping Regions
without similarity to other transcripts
The Sequence we want to design a probe for
5’
3’
BLAST hits >75% &
longer than 15bp
Regions suitable
for probes
50 bp
Filtering Self Detecting
BLAST hits out
The Sequence we want to design a oligo for
3’
5’
BLAST hits >75%
& longer than 15bp
Sequence identical or
very similar to the query
sequence
Therefore no BLAST hits with homology > 97% and
with a ‘hit length vs. query length’ ratio > 0.8,
are considered.
50 bp
Cross-hybridization
expressed as a score
Only BLAST hits that passed filtering are considered
If m is the number of BLAST hits considered in position i.
Let h=(h1 i,...,hm i) be the BLAST hits in position i in the oligo
i 1
BLAST max score 
100  n   max( h1i ,..., hmi )
n
100  n
Oligo
Where n is the length of the oligo
BLAST hits {
100%
Max hit in pos. i
0
Similar Affinity
for all oligos
Another way of ensuring a optimal discrimination between
target and non-target under hybridization is to design all
the oligos on an array with similar affinity for their targets.
This will allow the experimentalist to optimize the
hybridization conditions for all oligos by choosing the right
hybridization temperature and salt concentration.
Commonly Melting Temperature (Tm) is used as a
measure for DNA:DNA or RNA:DNA hybrid affinity.
Melting Temperature
difference
Tm(i ) 
1000DH
Ct
A  DS  R ln( )
4
 273.15  16.6 log[ Na ]
Where DH (Kcal/mol) is the sum of the nearest neighbor
enthalpy, A is a constant for helix initiation corrections, DS is
the sum of the nearest neighbor entropy changes, R is the
Gas Constant (1.987 cal deg-1 mol-1) and Ct is the total
molar concentration of strands.

DTm score  Tm(i) -
1
N
Tm(i)
N
Where N is all oligos in all sequences.
Tm distributions
for 30’mers and 50’mers
DTm Distribution
for probe length intervals
Avoid self annealing oligos
Sensitivity may be influenced
Probes that form strong hybrids with it self i.e. probes
that fold should be avoided.
But, accurate folding algorithms like the one employed
by mFOLD or RNAfold, is too time consuming, for large
scale folding of oligos.
Time consumption:
mFOLD ~2 sec / 30’mer
Pr. gene (500bp) ~16 min.
Folding an oligonucleotide
Minimal loop size border
.
{
{
{
Dynamic programming:
alignment to inverted self
.
.
.
.
.
Substitution matrix is
based on binding
energies
AT TG CT ........................................................................................CG GT TT
.
. .
. . .
. . . .
. . . . . .
. . . . .
. .
The alignment is
based on
dinucleotides
AT TG CT .........................................................................................CG GT TT
an approximation
Folding a lot of oligos
AT TG CT ........................................................................................CG GT TT
Full dynamic programming
calculation for first probe
Dynamic programming
calculation for second
etc. probe
.
.
.
.
.
.
. .
. . .
. . . .
. . . . . .
. . . . .
. .
Minimal loop size border
Last probe
. .
.
.
.
.
.
.
. .
. . .
. . . .
. . . . . .
. . . . .
. .
AT TG CT .........................................................................................CG GT TT
a fast heuristic implementation
Super-alignment matrix
Reasonably folding prediction
compared to mFOLD
Probes With Very Common
sub sequences may result in unspecific signal
If the sub-fractions of an oligo are very common we
define it as ‘low-complex’
Oligo with low-complexity:
AAAAAAAGGAGTTTTTTTTCAAAAAACTTTTTAAAAAAGCTTTAGGTTTTTA
(Human)
Oligo without low-complexity:
CGTGACTGACAGCTGACTGCTAGCCATGCAACGTCATAGTACGATGACT
(Human)
Low-complexity
expressed as a score
For a given transcriptome a list of information content
from all ‘words’ with length wl (8bp) is calculated:
f(w)
f(w) wl
I(w) 
log 2
4
tf(w)
tf ( w)
Where f(w) is the number of occurrences of a pattern and
tf(w) is the total number of patterns of length wl.
A low-complexity score for a given oligo is defined as:
= 1-norm (
complexity Low-complexity
score  1 - normalized
i 1
 I(w ))
i
L - wl1
Where norm is a function that normalizes to between 1 and 0,
L is the length of the oligo and
Wi is the pattern in position i.
Location of Oligo
within transcript
Labeling include reverse transcription of the mRNA
and is sensitive to:
- RNA degradation
- Premature termination of cDNA synthesis
- Premature termination of cRNA transcription (IVT)
Eukaryote Position Score:
3’ preference
Prokaryote Position score
Preference toward 3’, but avoid ~50 most 3’ bases
Typically eukaryote sample labeling is done by poly-T
and Bacterial samples by random labeling
Species databases
For 398 species are currently available
The species databases are
built from complete genomic
sequences or UniGene
collections in the case of
Vertebrates.
The databases are used for:
•Cross hybridization
•Low-complexity
Sequence Features
Intron/Exon structure, UTR regions etc.
-Special purpose arrays
-Example: Detecting Differential splicing
Exon
Exon
Intron
Exon
Exon
Annotation String
- single letter code
Single letter code.
Sequence:
Annotation:
ATGTCTACATATGAAGGTATGTAA
(EEEEEEEEEEEEEE)DIIIIIII
E: Exon
I: Intron
(:
):
D:
A:
Start of exon
End of exon
Donor site
Accepter site
Probe placement
using Regular Expressions search in annotation
Extracting annotation
from GenBank files
-FeatureExtract server
-www.cbs.dtu.dk/services/FeatureExtract
Exercise
•Running OligoWiz 2.0
•Java 1.4.1 or better is required
•Input data
•Sequence only (FASTA)
•Sequence and annotation
•Rule-based placement of multiple probes
•Distance criteria
•Annotation criteria
•Please go to the exercise web-page
linked from the course program
Download