Software Infrastructure for High-Performance Metagenomics Annotation “Working with thousands of jigsaw puzzles"

advertisement
Software Infrastructure for
High-Performance
Metagenomics Annotation
“Working with thousands of jigsaw puzzles"
by
Brandon Sutherlin
What is Metagenomics?

Metagenomics ( Environmental Genomics or
Community Genomics) is the study of genomes
recovered from environmental samples without the
need for culturing them

Metagenomics processes data using bioinformatics
tools
2
What is Bioinformatics?


There is no universally accepted definition of bioinformatics.
2 definitions:
 Classical Bioinformatics: Any use of computers to store, compare,
retrieve, analyze, simulate or predict the composition or the
structure of biologically based molecules (DNA, RNA, protein etc.)


New Bioinformatics: Bioinformatics derives knowledge from
computer analysis of biological data. These can consist of the
information stored in the genetic code, but also experimental
results from various sources, patient statistics, and scientific
literature.
Other fields are closely related and incorporated in the new
bioinformatics: (Eco-informatics, medical informatics, quantitative
biology, etc…)
3
Why is Metagenomics
Important?

All reasons lead to more knowledge.

Organisms can be studied directly in their environments
bypassing the need to isolate each species

There are significant advantages for viral metagenomics,
because of difficulties cultivating the appropriate host

Genomic information has advanced research in a diverse array
of fields, including forensic science and biomedical research
4
Whole Gene Shotgun Sequencing
for Metagenomics
One genome
Random genome
fragmentation
Multiple genomes
Genome assembly
using overlaps
Random genomes
fragmentation
Genomes assembly
using overlaps
5
Many projects, many fragments

Many different projects are now completed or in
progress

Examples:

Prokaryote:
• Sargasso Sea (Venter et al 2004) : 1.6 billion base pairs
generated estimated to come from 1800 genomic species

Viral:
• Marine water (Breitbart et al 2002) Mission Bay and Scripps Pier.
873 sequences for the Mission Bay and 1061 for Scripps Pier with
respectively more than 65% and 73% of unknown
6
Many projects, many fragments

Three years after the Marine Water project, most of
sequences are still unique. Despite the fact that
GenBank has more than doubled in size.

All of the Metagenome projects have generated
enormous amounts of data that still cannot be
assembled or annotated.
7
Bioinformatics
What bioinformatics tools are available now? The
focus is on viral Metagenomics



Assembly
Annotation
Diversity and structure prediction
8
Focus on Annotation

Tools and methods available


BLAST (Basic Local Alignment Search Tool):
The most popular tool for the annotation of
the fragments
Problems
A large portion of the sequences do not have
hits in the databases
 Initial BLAST may offer clues not annotation

9
Our Specific Application

Progressive Comparisons

Based on criteria (E-Value) selected by the
user, the input sequence can be compared to
other databases and/or used “Blasted” with
translation
10
Our Specific Application
AACTAACTCCGCTAAACG
DB1
If e-value is > X
DB4
If e-value is > X
DB
If e-value is < Xs
DB3
If e-value is < X
DB6
DB7
DB8
11
Our Specific Application (cont’d)
AACTAACTCCGCTAAACG
DB1
If e-value is > X
DB4
If e-value is > X
DB


If e-value is < Xs
DB3
If e-value is < X
DB6
DB7
DB8
The process tree can be as deep as the user desires
The nodes do not communicate directly with each other.

Later versions may have more intelligent clients
12
Executing the Application

Challenges

Many sequences, many large databases
• And they are getting larger

Therefore we need to run this in parallel
on multiple hosts (each with its local
storage)
A bunch of workstations in a lab
 Cluster
 A bunch of PCs over the internet

13
Master-Worker Approach
Client Map
Master
Node 1
DB 1,3,4


Node2
DB, 2,3,5
Node 3
DB 6
Node 4
DB 1,2,3,4,5,6
Current version has bidirectional communication
between an omniscient master and its faithful sheep
(clients/nodes).
A French graduate student is developing algorithms for
efficient (optimal?) database distribution among the
workers (in Dr. Casanova’s Lab)
14
Example Process Tree
15
The Nitty Gritty

The project was developed in Python


Why?
The configuration files are XML
Standardization
 Parsing support

16
How Well does it Work?

It is “Working as designed” on localhost


Ready to be tested on a small cluster
Is the need met?
• Dr. Poisson’s real life example, The 600

What about MHPCC resources?
17
My Supervisors for This Project


Allow me to introduce Guylaine Poisson
Ph.D. and Henri Casanova Ph.D.
Both outstanding mentors, U.H. faculty, and
foreign nationals
18
Future Application Development
ICS 675
 GUI
 Checkpoint
 Research
 Release to the Community

19
What Did I Learn?
Python
 Sockets and Network Programming
 Blast
 Mac OSX
 Time Management

20
Gratitude in No Particular Order
Dr. Poisson and Dr. Cassanova
 Dr. Brown
 MHPCC



H.R. People…
All of You…
21
Questions or Comments?
Download