A Study of GeneWise with the Drosophila Adh Asta Gindulyte CMSC 838 Presentation

advertisement
A Study of GeneWise with the
Drosophila Adh Region
Asta Gindulyte
CMSC 838 Presentation
Authors: Yi Mo, Moira Regelson, and Mike Sievers
Paracel Inc., Pasadena, CA
Motivation

Genome annotation



Extraction of biologically relevant knowledge from raw
genomic sequence data
Need faster genome annotation methods

DNA sequences are very long (millions of nucleotides)

Current methods are computationally too expensive
Approach/Solution

GeneMatcher2 hardware acceleration of GeneWise
CMSC 838T – Presentation
Outline

Motivation




Genome annotation
GeneMatcher2

Design

ASIC hardware
Comparison

GeneWise algorithm

HalfWise algorithm

Performance (time, precision)
Observations

Performance improvement

Cost effectiveness
CMSC 838T – Presentation
Approach


Problem: make GeneWise run faster

“Embarassingly parallel” algorithm

Computationally too expensive when run in parallel on PC’s
Paracell’s solution: hardware acceleration

Don’t change the algorithm

Produce an implementation on the GeneMatcher2
supercomputer that works as much like the original software as
possible

6LITE algorithm, now also in Wise2
CMSC 838T – Presentation
GeneMatcher Architecture
CMSC 838T – Presentation
ASIC Hardware

ASIC – application specific integration circuit

Designed to speed up dynamic programming algorithms

(could be used for Smith-Waterman)

Each ASIC board has 3072 processors

System has up to 9 boards

Cost per board around $40K
CMSC 838T – Presentation
GeneWise Algorithm

Perform a search of genomic DNA sequence data using
a protein HMM

Build HMMs from protein families

Scan genome using HMM




Look for start codon
“GT” sequence signals possible 5’ splice site
“AG” sequence signals possible 3’ splice site
Dynamic programming used in the scanning process


Obtain probability of the most likely path in HMM generating
the sequence
Obtain alignment by backtracking
CMSC 838T – Presentation
GeneWise model on GeneMatcher2
CMSC 838T – Presentation
HalfWise Algorithm

Reduce cost by running BLAST to select HMMs with
possible hits

Use these HMMs with GeneWise database search and
sequence alignment algorithm

May miss some genes due to BLAST misses
CMSC 838T – Presentation
Evaluation

Test data set

A genomic DNA sequence contig of about 2.9 Mb from the
Drosophila Adh region

Focuss on finding all Pfam (Protein families database of
alignments and HMMs) protein profile-HMMs that occur in the
Adh genomic sequence
CMSC 838T – Presentation
Evaluation: Speed
CMSC 838T – Presentation
Evaluation: Score
CMSC 838T – Presentation
Evaluation: Sensitivity and Specificity
CMSC 838T – Presentation
Observations

Performance improvement

The speedup is several orders of magnitude.

Makes real target applications possible
Accuracy might be improved over HalfWise algorithm



Cost effectiveness

System used costs around $500K

500K worth Linux PC’s (500 processors at $1K each) would run
about 10 times slower
Weaknesses

Cannot modify the algorithm

Not enough data to assess scalability
CMSC 838T – Presentation
Download