Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of Human Genetics, McGill University Montreal, Quebec, Canada December 8th, 2011 Next-Gen Gap “Unfortunately, the software and computer hardware demands on these analyses are not much less than those of the large Genome Centers. From this perspective, the gap between large-scale genome centers and individual investigators may seem to be growing, not shrinking, as the next-generation platforms’ apparent promise of a ‘Genome Center in a box’ may have only been half delivered, providing data without a full suite of tools.” (Nature Methods 6, S2 - S5 (2009)) Bacterial genome in < 1 week for ~ $3000 (Genome Assembly)+ Download Data Learn *NIX Install Software and Dependencies Run Software … Wait? … Problems? Three Common Methodologies in de novo Genome Analysis 1. Display and analysis of genome annotations 2. Quality assessment of a genome assembly 3. Comparison and mining of genomic data from public repositories. One or more methodologies used to address needs in three specific projects; projects used as a vehicle to develop software: Project Software Methodology C. difficile 14 Genome Comparison cgb 1. Genome Display Multi-centre WGS of O. novo-ulmi ContiGo 2. Assembly QA E. fergusonii ECD-227 BLAST in Pivot 3. Data Mining 3 Assembly Quality Assessment Assembly Analysis DNA Sequencing Centre Researcher Assembly • Researchers should have easy access to determine quality and perform simple analysis. • Delays and limits on data access exist: - Viewers need to be installed and have specific software (e.g. Linux) or hardware requirements (e.g. RAM). - Assembly data (multiple GBs) must be downloaded. Objective • Develop a simple assembly viewer that operates within a web-browser, allowing a researcher to rapidly analyze and access their data. Method Parser/Converter: Used python to parse, analyze, and convert assembly data into web accessible formats (HTML, JSON, JPG images) which are stored on sequence centre servers. Interface: Use browser-based interface (HTML) to dynamically access data (Javascript) on servers. Incorporates pre-existing webtechnologies (JQuery, Seadragon Deepzoom AJAX). Usage: - after genome assembly, parser/converter is run on sequencing center servers - researcher accesses interface over the internet using a modern web browser Performance Parser/Converter: – Multiple platforms (Windows/OS X/Linux) – Multi-processor support. – Low memory usage (< 250Mb of memory per processor). User interface: – Client-side programming decreased server load – Data is downloaded is on-demand limited bandwidth users. – Sole system requirement: a modern web-browser (Firefox, Opera, Google Chrome) ease of installation. – Low memory usage (peaks at ~ 250 Mb). The Interface Assembly statistics, batch download of sequence and statistical data. Table of contig/scaffold statistics: •Sortable/Filter by column •Access to contig sequence/quality and read sequences. Contig Assembly: -Pan/Zoom - Identify position, read names, mismatches Dynamic Charts: • toggle axis value • identify points • summarize regions Demo 3. Data Mining blip.codeplex.com Microsoft Research Summer Internship Microsoft Biology Foundation Redmond, Washington, USA Mentor - Simon Mercer blip.codeplex.com BLAST NCBI ACGTCACTGACTG ACTAGCTAGCTAG CTAGCATCGATCG ATCGATCGATCGA TCGACGTAACTAG CACGACTGACTCT Species, Function, … ? Local blip.codeplex.com Limitation Scientist >gi|301326298|ref|ZP_07219671.1| TIM-barrel protein, nifR3 family [Escherichia coli MS 78-1] Length=321 Score = 583.563 bits (1503), Expect = 8.65371E-165 Identities = 280/281 (100%), Positives = 280/281 (100%), Gaps = 0/281 (0%) Frame = 0 Query 1 Sbjct 41 Query 61 Sbjct 101 Query 121 Sbjct 161 Query 181 Sbjct 221 Query 241 Sbjct 281 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 100 PAKKVNRKLAGSALLQYPDVVKSILTEVVNAVDVPVTLKIRTGWAPEHRNCEEIAQLAED PAKKVNRKLAGSALLQYPDVVKSILTEVVN VDVPVTLKIRTGWAPEHRNCEEIAQLAED PAKKVNRKLAGSALLQYPDVVKSILTEVVNTVDVPVTLKIRTGWAPEHRNCEEIAQLAED 160 = + = 60 120 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 220 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 280 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA + 180 240 281 321 ~5000 genes E. coli Programmer blip.codeplex.com Blast in Pivot ACGTCACTGACTG ACGTCACTGACTG ACTAGCTAGCTAG ACGTCACTGACTG ACTAGCTAGCTAG CTAGCATCGATCG ACTAGCTAGCTAG CTAGCATCGATCG ATCGATCGATCGA CTAGCATCGATCG ATCGATCGATCGA TCGACGTAACTAG ATCGATCGATCGA TCGACGTAACTAG CACGACTGACTCT TCGACGTAACTAG CACGACTGACTCT CACGACTGACTCT ??? 1 2 3 blip.codeplex.com E. coli ECD227 E. coli ECD-227 Acknowledgement Moussa Diarra, Heidi Rempel Demo Conclusions ContiGo: used by clients of the Genome Centre at McGill (release soon). BL!P: >500 downloads (blip.codeplex.com). 18 Acknowledgements C. difficile Ophiostoma novo-ulmi Ken Dewar Jan Kieleczawa Andre Dascal Michael Zianni Matthew Oughton Robert Steen Joana Dias Deborah Grove Gary Leveque Anoja Perera Pascale Marquis Robert Lyons Jr. Corina Nagy Sushmita Singh Amelie Villeneuve Doug Bintzler Ivan Brukner, Mark Miller Scottie Adams Vivian Loo Deborah Grove Mike Mulvey Gregory Grove Dale Gerding Robert Lyons Jr. Maya Rupnik Suzanne Genik Elaine Mardis Chris Wright V. Magrini Alvaro Hernandez M. Hickenbotham Sharon Bachman K. Haub Lorie Hetrick C. Markovic Sushmita Singh J. Nelson Nichole Peterson Gary Leveque Joana Dias Clotilde Teiling Tim Harkins E. coli ECD-227 H. Rempel Andrew Metcalfe M. S. Diarra BL!P/Microsoft Simon Mercer Xin-Yi Chua Mauro Luigi Drago Beatriz Diaz Acosta Vivek Kumar Bob Davidson Mike Zyskowski Xiaoji Chen Bob Silverstein Vikram Bapat Jared Jackson Wei Lu The Pivot Team 19