Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for

advertisement
Accurate Method for Fast Design of
Diagnostic Oligonucleotide Probe Sets for
DNA Microarrays
Nazif Cihan Tas
ctas@cs
CMSC 838 Presentation
Motivation

DNA microarrays techniques are used intensely for
identification of biological agents

Gene Expression Studies

Diagnostic Purposes

Identification of Microorganisms in samples
Item Extraction


Complex Problem

Find the necessary probes and the temperature

Probe sets should be reliably detect and differentiate target
sequences

Large Databases

NEW!! Homologous Genes (how to find specific probes)
CMSC 838T – Presentation
Talk Overview

Overview of talk

Motivation

Problem Statement

Algorithm

Mathematical Aspects

Experimentation

Discussion
CMSC 838T – Presentation
Problem Statement

Positive Probes

Database set S0

Target S1

For each sequence in S1, find at least one probe

For S0 - S1 try to avoid it (but do not care if happens)
S1
S0

High Specificity: # of non-target matches are minimized

High Sensitivity: # of covered target seq. is maximized
CMSC 838T – Presentation
Problem Statement

Negative Probes

Determine as few as possible probes which together hybridizes
with all sequences in S0 - S1 but with NONE in S1.
S1
S0

High Specificity: No seq. in S1 may hybridize

High Sensitivity: Max # of seq. in S0 - S1 be covered
CMSC 838T – Presentation
Probe Design Constraints

Sequence Related





Length of probes
Deviation of melting temperature of probe-target hybrids must
be low (for physical reasons)
No self complementary regions longer than four nucleotides
(not descriptive enough)
Melting temperatures of target and non-target seq. must be
larger than a predefined (too close, too hard to identify)
 Ensuring a minimum number of mismatches is enough
(homologous sequences)
System Related


Execution Time
Usability
CMSC 838T – Presentation
Algorithm

Overview

Probe Generation

Hybridization Prediction

Probe Selection
CMSC 838T – Presentation
Algorithm
Probe Generation


Subproblem:

Generate probe candidates for the sequences

Keep the set as small as possible without losing any optimal
candidate (exclude infeasible ones)
Suffix Tree

Why?

Allows fast recognition of repetitive subsequences
 Identifies non-unique probes (i.e. with more than one target)
 Efficient for memory and for T computation (reduce time)
How?



Tree is constructed from the sequences
Traversed (Watson-Crick complement)
CMSC 838T – Presentation
Suffix Tree

Input: TACTACA

TACTACA

ACTACA

CTACA

TACA

ACA

CA

A

$ denotes end of string

Constructed in linear time
CMSC 838T – Presentation
Probe Generation

Further Improvements

Filters applied for cut off



Probe length (predefined)
G-C content (for temperature)
Self-complementarity

Probes should not contain complements as subsequences

Finally, remove highly conserved (non-specific) regions

Insert into hashtables according to their lengths
CMSC 838T – Presentation
Algorithm
CMSC 838T – Presentation
Algorithm
Hybridization Prediction

Subproblem:



Design





Search for the right probe
Search is expensive, Intelligent Hashing used
A frame is moved over target and nontarget seqs. with several
lengths
 Previous algorithm (Kaderali 2002): Use the suffix tree
At each step, hash values are calculated. If hit, predict melting
temperature, store in hybridization matrix.
If there are too many hits for a probe, then it is not unique, remove it
Why intelligent?
 Hash time is linear
 Allows inexact matching because of hashing (No analysis)
Parallelization


Several threads are searching for probe targets.
 Tree and hashtables are fixed.
One thread writes to the final matrix
CMSC 838T – Presentation
Hybridization Prediction

Empirical Simulation:




One million random probe-target pairings generated
Four mismatches or one insertion or deletion plus one strong
central mismatch chosen
T<20 C for 93%
Complexity is O ( |S0| |S1| )
 Possible probe candidates is |S1|
(linear)
 Each position of database S0 must be checked
CMSC 838T – Presentation
Algorithm
CMSC 838T – Presentation
Algorithm
Probe Selection

Subproblem:


Algorithm Analysis:





For each probe candidate
 g: #of matches in S1
 b: #of matches in S0 - S1
 t: highest melting point in S1
Probes for which g or b values is too large, are removed
Sort according to g,b and t.
Apply Depth First Search
Advantages



Use the hybridization matrix to finalize the probe selection
 We have positive probes and negative probes to proceed
Performs well (No comparison though)
Guarantees to choose all specific probes if any were found.
Disadvantages

can NOT guarantee optimal selection in terms of coverage
CMSC 838T – Presentation
Negative Probe Selection

Let S2 =S0 - S1 and B subset of S2 . The probes that
detect S1 also detects some of B elements.

Algorithm for Negative Probes


Apply probe generating and preselection for B.

Conduct hybridization on B U S1 .

Remove the probes which hybridizes with S1 .

Sort the remaining probes according to their hit number.

Successively select the probes which covers most target seq.
Guarantees optimal solution for coverage and probe
number usage
CMSC 838T – Presentation
Algorithm
Probe Selection
CMSC 838T – Presentation
Mathematical Aspects
CMSC 838T – Presentation
Experimentation


Parallelized on SMP platform

Classic worker-producer

Intel Dual Pentium III 933 MHz, 1 GB memory
Test data

ssu rRNA of ARB project

20.282 ssu rRNA sequences

1.401 < lengths < 4.179

%97 of them are shorter than 2.000 bases
CMSC 838T – Presentation
Discussion

High Performance


Execution is linear with size of database, decreases if longer
probes are used
Low Memory Consumption

Depends on the size of the sequence selection, NOT database
size

Automatic Design of Group Probes and negative
probes

High Quality Probe Design
CMSC 838T – Presentation
Discussion

Comparison with previous work

vs. ARB

Not suited for large scale probe design
vs. LCF

Does not consider highly conserved data
 Memory consumption is high
 Works well with short probes only
vs. others




Mostly can not deal with insertion and deletions
Execution is slow
CMSC 838T – Presentation
Download