A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA Motivation Genome annotation Extraction of biologically relevant knowledge from raw genomic sequence data Need faster genome annotation methods DNA sequences are very long (millions of nucleotides) Current methods are computationally too expensive Approach/Solution GeneMatcher2 hardware acceleration of GeneWise CMSC 838T – Presentation Outline Motivation Genome annotation GeneMatcher2 Design ASIC hardware Comparison GeneWise algorithm HalfWise algorithm Performance (time, precision) Observations Performance improvement Cost effectiveness CMSC 838T – Presentation Approach Problem: make GeneWise run faster “Embarassingly parallel” algorithm Computationally too expensive when run in parallel on PC’s Paracell’s solution: hardware acceleration Don’t change the algorithm Produce an implementation on the GeneMatcher2 supercomputer that works as much like the original software as possible 6LITE algorithm, now also in Wise2 CMSC 838T – Presentation GeneMatcher Architecture CMSC 838T – Presentation ASIC Hardware ASIC – application specific integration circuit Designed to speed up dynamic programming algorithms (could be used for Smith-Waterman) Each ASIC board has 3072 processors System has up to 9 boards Cost per board around $40K CMSC 838T – Presentation GeneWise Algorithm Perform a search of genomic DNA sequence data using a protein HMM Build HMMs from protein families Scan genome using HMM Look for start codon “GT” sequence signals possible 5’ splice site “AG” sequence signals possible 3’ splice site Dynamic programming used in the scanning process Obtain probability of the most likely path in HMM generating the sequence Obtain alignment by backtracking CMSC 838T – Presentation GeneWise model on GeneMatcher2 CMSC 838T – Presentation HalfWise Algorithm Reduce cost by running BLAST to select HMMs with possible hits Use these HMMs with GeneWise database search and sequence alignment algorithm May miss some genes due to BLAST misses CMSC 838T – Presentation Evaluation Test data set A genomic DNA sequence contig of about 2.9 Mb from the Drosophila Adh region Focuss on finding all Pfam (Protein families database of alignments and HMMs) protein profile-HMMs that occur in the Adh genomic sequence CMSC 838T – Presentation Evaluation: Speed CMSC 838T – Presentation Evaluation: Score CMSC 838T – Presentation Evaluation: Sensitivity and Specificity CMSC 838T – Presentation Observations Performance improvement The speedup is several orders of magnitude. Makes real target applications possible Accuracy might be improved over HalfWise algorithm Cost effectiveness System used costs around $500K 500K worth Linux PC’s (500 processors at $1K each) would run about 10 times slower Weaknesses Cannot modify the algorithm Not enough data to assess scalability CMSC 838T – Presentation