Software Infrastructure for High-Performance Metagenomics Annotation “Working with thousands of jigsaw puzzles" by Brandon Sutherlin What is Metagenomics? Metagenomics ( Environmental Genomics or Community Genomics) is the study of genomes recovered from environmental samples without the need for culturing them Metagenomics processes data using bioinformatics tools 2 What is Bioinformatics? There is no universally accepted definition of bioinformatics. 2 definitions: Classical Bioinformatics: Any use of computers to store, compare, retrieve, analyze, simulate or predict the composition or the structure of biologically based molecules (DNA, RNA, protein etc.) New Bioinformatics: Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Other fields are closely related and incorporated in the new bioinformatics: (Eco-informatics, medical informatics, quantitative biology, etc…) 3 Why is Metagenomics Important? All reasons lead to more knowledge. Organisms can be studied directly in their environments bypassing the need to isolate each species There are significant advantages for viral metagenomics, because of difficulties cultivating the appropriate host Genomic information has advanced research in a diverse array of fields, including forensic science and biomedical research 4 Whole Gene Shotgun Sequencing for Metagenomics One genome Random genome fragmentation Multiple genomes Genome assembly using overlaps Random genomes fragmentation Genomes assembly using overlaps 5 Many projects, many fragments Many different projects are now completed or in progress Examples: Prokaryote: • Sargasso Sea (Venter et al 2004) : 1.6 billion base pairs generated estimated to come from 1800 genomic species Viral: • Marine water (Breitbart et al 2002) Mission Bay and Scripps Pier. 873 sequences for the Mission Bay and 1061 for Scripps Pier with respectively more than 65% and 73% of unknown 6 Many projects, many fragments Three years after the Marine Water project, most of sequences are still unique. Despite the fact that GenBank has more than doubled in size. All of the Metagenome projects have generated enormous amounts of data that still cannot be assembled or annotated. 7 Bioinformatics What bioinformatics tools are available now? The focus is on viral Metagenomics Assembly Annotation Diversity and structure prediction 8 Focus on Annotation Tools and methods available BLAST (Basic Local Alignment Search Tool): The most popular tool for the annotation of the fragments Problems A large portion of the sequences do not have hits in the databases Initial BLAST may offer clues not annotation 9 Our Specific Application Progressive Comparisons Based on criteria (E-Value) selected by the user, the input sequence can be compared to other databases and/or used “Blasted” with translation 10 Our Specific Application AACTAACTCCGCTAAACG DB1 If e-value is > X DB4 If e-value is > X DB If e-value is < Xs DB3 If e-value is < X DB6 DB7 DB8 11 Our Specific Application (cont’d) AACTAACTCCGCTAAACG DB1 If e-value is > X DB4 If e-value is > X DB If e-value is < Xs DB3 If e-value is < X DB6 DB7 DB8 The process tree can be as deep as the user desires The nodes do not communicate directly with each other. Later versions may have more intelligent clients 12 Executing the Application Challenges Many sequences, many large databases • And they are getting larger Therefore we need to run this in parallel on multiple hosts (each with its local storage) A bunch of workstations in a lab Cluster A bunch of PCs over the internet 13 Master-Worker Approach Client Map Master Node 1 DB 1,3,4 Node2 DB, 2,3,5 Node 3 DB 6 Node 4 DB 1,2,3,4,5,6 Current version has bidirectional communication between an omniscient master and its faithful sheep (clients/nodes). A French graduate student is developing algorithms for efficient (optimal?) database distribution among the workers (in Dr. Casanova’s Lab) 14 Example Process Tree 15 The Nitty Gritty The project was developed in Python Why? The configuration files are XML Standardization Parsing support 16 How Well does it Work? It is “Working as designed” on localhost Ready to be tested on a small cluster Is the need met? • Dr. Poisson’s real life example, The 600 What about MHPCC resources? 17 My Supervisors for This Project Allow me to introduce Guylaine Poisson Ph.D. and Henri Casanova Ph.D. Both outstanding mentors, U.H. faculty, and foreign nationals 18 Future Application Development ICS 675 GUI Checkpoint Research Release to the Community 19 What Did I Learn? Python Sockets and Network Programming Blast Mac OSX Time Management 20 Gratitude in No Particular Order Dr. Poisson and Dr. Cassanova Dr. Brown MHPCC H.R. People… All of You… 21 Questions or Comments?