Genomics and High Performance Computing Folker Meyer

advertisement
Genomics and
High Performance
Computing
Folker Meyer
Argonne National Laboratory and University of Chicago
Brief intro:
 I am a computer scientist turned computational biologist
 My CS friends tell me I am a biologist
 My BIO friends tell me I am a CS person
 So clearly take my comments with a grain of salt 
 My research interests:
– Metagenomics to study microbial role in
• geochemical cycling (Climate, Remediation)
• human health (Human Microbiome)
 Strong emphasis on technology development to allow study
– Algorithms, integrations, tools, genomics tech
Argonne National
Laboratory
Why microbes?
Source: Rob Knight, U. Colorado
Argonne National
Laboratory
Biology changed. From this …
http://www.the-aps.org/education/
http://www.ferrum.edu/majors/biology.jpg
http://www.oneocean.org/ambassadors/track_a_turtle/biology
These are: “biology.png”, “biology.gif” and “biology.jpg”.
4
Argonne National
Laboratory
… to this. (in ~2003)
Argonne National
Laboratory
DNA Technology advances … FAST
Technology
Read
length
Mbp/hr
1977
Sanger
~100
~0.0001
1987
ABI 370
~500
~2001
ABI 3700
2005
MBp/run
Cost/ run
~0.001
?
0.01
0.05
~$1,000
~1000
0.1
0.5
$200
454 pyroseq
120-450
25-100
500
$13,000
2007
Solexa
50-125bp(1)
~1000
50,000-200,000
$15,000
2007
ABI Solid
35-50bp
75-250
6,000-20,000
~$5,000
2010
Ion Torrent(2)
~200
100
50
$500
2010?
Helicos
25-50
~50
?
$20,000
2010?
PacBio
700
120
30
$100
1) Solexa paired-end now 200-220bp
2) Machine cost ~$50k
Argonne National
Laboratory
Ongoing change
Argonne National
Laboratory
1) Democratization of data generation
From factory to bench-top
And
>70% of Illumina machines go to “small” customers(1)
1) From Illumina at 2010 GIA meeting
Argonne National
Laboratory
2) Data set size
 Instrument output went from <<1 GB to 200 GB in <5 years
 24-48 month trend looks … interesting
 to put this into perspective:
– All genomics knowledge (“RefSeq”) was 57GB when I last
checked
– “Large” Global Ocean Survey study by Craig Venter in 2004 was
600 megabases
 When Biologist speak of big data if frequently fits on an IPod
Argonne National
Laboratory
3) Computing cost dominate
$
$900,000
$600,000
$240,000
$300,000
$300,000
$120,000
$45,000
$30,000
$30,000
$30,000
Sequencing
$30,000
$15,000
$15,000
Bioinformatics
$15,000
$7,000
$3,500
$3,000
0.5
1
454
30
60
98
GAIIx
196
HiSeq2000
• 95GB == 195,600 node hours (on Nehalem 8core, 16GB),
• Illumina HiSeq2000 = 2x100GB/run available today
• cost is purely BLASTX (no storage or transfer cost) on Amazon EC2
Source: Wilkening et al., Proceedings IEEE Cluster09, 2009
• note: 10x or 100x improvements over BLASTX will help, but not solve
Argonne National
Laboratory
294
GB
What does genomics look like?
Argonne National
Laboratory
Example: Analysis of pathogen: B.
subtilis
Argonne National
Laboratory
Sequencing a genome
Where are we?
Closing gaps between contigs
validating the sequence via a map
Argonne National
Laboratory
Genome sequencing is now routine
Figures: © Nikos Kyrpides DOE Joint Genome Institute, G.O.L.D.
Argonne National
Laboratory
Genome sequencing: a success story
 Genome assembly used to require large machines
 and significant manual effort
 More data and novel codes make this much easier
– Velvet, newbler, AllPaths, …..
 Bacterial genome sequencing and assembly sometimes in
a day
 While some quality issues still exist
 complete automation on low-cost machine is likely to
happen in the next 12-24 months
Argonne National
Laboratory
A genome:
…CGGGGGAGCCCTCCAGAATACCCATCA TATAG CCCCT GAGGT GGCAT GGGA TGTCT CCATG AGGGA ACCCC TTCC CACTT CATAC TGTC
ACGTATA TCATA GTGT TC TTGACTG GGCCA TTCA TC TAAGATG GGATT TACC CT GTGAAAC AGGGA GAAG AC TTATGGA CCCCA AGCA TC AT
TTCAAGT TGAAG TTGA GT TTTTAAA AGCCA TCCA TG CAAAGTT CCTTT GCTT TG GACCCTC TGCAT TATT AA AGCTGCT GTATT GCTA AC CC
AGAACTG CTCCA GTGT CT TGACTGA TCATC ATGG CT TCAGTTT GGAAG AGAC TG CAGCGTG TGGGA AAAC AT GCATCCA AGTTC CAGT TT GT
GGCCTCC TACCA GGAG CT CATGGTT GAGTG TACG AA GAAATGG TAACC AGAT AA ACTGGTG GTAGA TGAA GA CATGCAA AGTTT GGCT AG TT
TGGTGAG TATGA AGCA GG CTGACAT TGGCA ATTT AG ATGACTT CGAAG AAGA TA ATGAAGA TGATG ATGA GA ACAGAGT GAACC AAGA AG AA
AAGGCAG CTAAA ATTA CA GAGCTTA TCAAC AAAC TT AACTTTT TGGAT GAAG CA GAAAAGG ACTTG GCCA CC GTGAATT CAAAT CCAT TT GA
TGATCCT GATGC TGCA GA ATTAAAT CCATT TGGA GA TCCTGAC TCAGA AGAA CC TATCACT GAAAC AGCT TC ACCTAGA AAAAC AGAA GA CT
CTTTTTA TAATA ACAG CT ATAATCC CTTTA AAGA GG TGCAGAC TCCAC AGTA TT TGAACCC ATTCG ATGA GC CAGAAGC ATTTG TGAC CA TA
AAGGATT CTCCT CCCC AG TCTACAA AAAGA AAAA AT ATAAGAC CTGTG GATA TG AGCAAGT ACCTC TATG CT GATAGTT CTAAA ACTG AA GC
AGAGCTT AGTGA TCTG AA GCGGGAG CCTGA ACTA CA ACAGCCT ATCAG CGGA GC GTGACAG GTACG TGAT GC TAGCTTT TATCA GGCA GC GG
TATGCGC GATCA ATGC GC GCGGCTA TATGA TCTG CA TGCGGCG CGATT ACTC TT CGGAGCT TATTT CTGC GG CGGGCCG GGGGA GCCC TC CA
GAATACC CATCA TATA GC CCCTGAG GTGGC ATGG GA TGTCTCC ATGAG GGAA CC CCTTCCC ACTTC ATAC TG TCACGTA TATCA TAGT GT TC
TTGACTG GGCCA TTCA TC TAAGATG GGATT TACC CT GTGAAAC AGGGA GAAG AC TTATGGA CCCCA AGCA TC ATTTCAA GTTGA AGTT GA GT
TTTTAAA AGCCA TCCA TG CAAAGTT CCTTT GCTT TG GACCCTC TGCAT TATT AA AGCTGCT GTATT GCTA AC CCAGAAC TGCTC CAGT GT CT
TGACTGA TCATC ATGG CT TCAGTTT GGAAG AGAC TG CAGCGTG TGGGA AAAC AT GCATCCA AGTTC CAGT TT GTGGCCT CCTAC CAGG AG CT
CATGGTT GAGTG TACG AA GAAATGG TAACC AGAT AA ACTGGTG GTAGA TGAA GA CATGCAA AGTTT GGCT AG TTTGGTG AGTAT GAAG CA GG
CTGACAT TGGCA ATTT AG ATGACTT CGAAG AAGA TA ATGAAGA TGATG ATGA GA ACAGAGT GAACC AAGA AG AAAAGGC AGCTA AAAT TA CA
GAGCTTA TCAAC AAAC TT AACTTTT TGGAT GAAG CA GAAAAGG ACTTG GCCA CC GTGAATT CAAAT CCAT TT GATGATC CTGAT GCTG CA GA
ATTAAAT CCATT TGGA GA TCCTGAC TCAGA AGAA CC TATCACT GAAAC AGCT TC ACCTAGA AAAAC AGAA GA CTCTTTT TATAA TAAC AG CT
ATAATCC CTTTA AAGA GG TGCAGAC TCCAC AGTA TT TGAACCC ATTCG ATGA GC CAGAAGC ATTTG TGAC CA TAAAGGA TTCTC CTCC CC AG
TCTACAA AAAGA AAAA AT ATAAGAC CTGTG GATA TG AGCAAGT ACCTC TATG CT GATAGTT CTAAA ACTG AA GCAGAGC TTAGT GATC TG AA
GCGGGAG CCTGA ACTA CA ACAGCCT ATCAG CGGA GC GTGACAG GTACG TGAT GC TAGCTTT TATCA GGCA GC GGTATGC GCGAT CAAT GC GC
GCG
…
Argonne National
Laboratory
From genome to information: Annotation
Bioinformatics
Source: A. Becker, U of Freiburg,Germany
Argonne National
Laboratory
Annotation …
 In the old days:
– Find every possible gene
– Run every tool known to mankind
• BLAST, HMMer, …
– Against every known database
• RefSeq, PFAM, InterPro, KEGG, COG, …
– Have humans interpret the results
HTGA
 Several drawbacks:
– Computationally expensive  fixable with $$
– Requires lots of FTEs  fixable with $$$$
– Subjective factors come into play  fixable with standards?
• Still an open debate 
Argonne National
Laboratory
Resulting compute requirements





…
bacterial genomes contain ~1000 genes per Megabase
BLAST vs NCBI NR search takes >10min per gene
annotations often require EC (Enzyme) numbers
protein domains (Pfam) help gain confidence
genomic context comparison (“Clusters”)
= 10.000 min
+ 10.000 min
+ 50.000 min
+200.000 min
“Clusters” and “Pfam” have highest confidence
BLAST viewed as error prone
 CPU investment and quality are correlated
 more computing helps
 most groups can not pay the price Source: Informal survey of ~20 manual annotators
Argonne National
Laboratory
Things change ….
 More sequences are being annotated
 Database grows
 Human annotator expertise shrinks relatively
100%
50%
relative
annotator
expertise
DNA
space
Sanger
Argonne National
Laboratory
time
WGS
Pyrosequencing
1%
Ecosystem is unsustainable
 As sequencing becomes so cheap
 Analysis is the bottleneck
 Community needs a scalable solution
 Many view standards as the solution:
– It is unclear to me how standards alone can help this problem
 As we get more data computes take longer, everything becomes
more complex
Argonne National
Laboratory
Annotation: Another success story
 Many expensive solutions exist
 But there is also a novel approach:
 RAST server (Aziz et al, BMC Genomics, 2008)
– Team of CS and Bio experts developed novel approach
• Subsystem technology
– Combining domain knowledge from both areas
• Integrate data curation and annotation using novel approaches
• requiring far less resources, better accuracy
 Annotate several genomes in a day on a laptop
 Server has processed over 12k genomes since 2008
– Note: Works only for bacteria
– Extension to other areas possible
– Try it: http://rast.nmpr.org
Argonne National
Laboratory
The future ..
 Bacterial genomics has become easy
 Larger genomes remain harder
– But plummeting sequencing cost will help
 “traditional” genomics
– Sequencer output is “compressed” to a few contigs
– Via “assembly” to a fraction of the size
– Human genome only has 20,000 genes
 Image what would happen w/o assembly
 Every sequence a “gene”
The next big thing:
Metagenomics
Argonne National
Laboratory
Cost per base
Source: Rob Knight, UColorado
Argonne National
Laboratory
Metagenomics needs the magic wand..
== “shotgun genomics applied directly to
various environments”
–  “shotgun metagenomics”
!= sequencing of BAC clones with env. DNA
–  “functional metagenomics”
!= sequencing single genes (16 rDNA)
–  “gene surveys”
What are they doing?
Who are they?
data
Argonne National
Laboratory
Community Structure and Metabolism
Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4,
Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2
NATURE |VOL 428 | 4 MARCH 2004 | www.nature.com/nature
Size of flagship projects …
Date
Metagenome
Size
type
2004
Acid Mine
Drainage
76Mbp
Sanger
2004
GOS-I
700Mbp
Sanger
2009
ANL-Subsurface
12GBp
Solexa 2x75PE
2009
JGI-Cow-rumen
17GBp
Solexa
2010
JGI-Cow-rumen
255Gbp
Solexa
2010
MetaHIT
500GBp
Solexa
2010
DeepSoil
70GBp
Solexa 180bp (125x2)
2010
HMP
5.7TBp
Solexa 100x2
Argonne National
Laboratory
Metagenome tasks
 Metabolic functions
 Community members
 Who is doing what
Argonne National
Laboratory
“binning” genes to genomes
Sync… We are here:
 8492 metagenomes from > 500 groups
 Over 10GB per week (rapid growth)
 Many centers produce data
Argonne National
Laboratory
MG-RAST – metagenomics RAST server
Data growing over time
• open access, web based
• user upload data sets
• >3500 data submitters
Web UI
Argonne National
Laboratory
Scaling up an MG-RAST v2




~3,500 users (data submitters)
~200 daily users (>10 minutes)
V2.0 was a typical bioinformatics app ( next slide)
Throughput was becoming a major problem approaching 1 GB per
week in late 2008
 Need a mechanisms allowing more throughput
Argonne National
Laboratory
Technology choices (typical for BIO)
 Tightly integrated system
– Pleasantly parallel code
– NFS for data movement
– central database server
– Workflow management via SGE
 Running on ~50 machines locally
40 node Dual-PPC-Cluster
shares NFS filesystem with all systems
Argonne National
Laboratory
2x
Performance analysis MG-RAST
jobs
small
Run time hours
large (avg. ~0.1Gb)
 A snapshot with little
wait time
 Most time is spent in
SIMS
 Short jobs spend a long
time in create
jobs
• Careful analysis of all computations
• IO to CPU ratio determines suitability of platforms
Argonne National
Laboratory
Redesign workflow
 Enable use of distributed computing platforms
– Including e.g. BG/P, EC2, Azure and local clusters
 Enable users to contribute resources
 Be robust, scalable, fault tolerant
 Enable replacement of algorithm with more efficient ones
 Enable support for staged database updates
 Built a prototype workflow engine
– Argonne Workflow Environment (A.W.E)
Argonne National
Laboratory
A.W.E.
A
AWE SERVER
webserver
db
Work
AWE Client
request
B
fna
fna
fna
C
result
s
fna
Facebook’s Tornado
+ SQLalchemy
RESTful
diverse clients
• Single analysis operation results in a series of work units
• Client requests a lease on work, with a timeout
• Results reported to the server, with failures resulting in lease
expiration
• REST interface scales well and minimizes prerequisites
• Client can size requests to local computational capacity
• Tested up to ~500 clients
Argonne National
Laboratory
What will the future bring?
Argonne National
Laboratory
I lost my crystal ball
 BUT:
 A lot more data seems a safe bet
 A lot more computing is certain
 The computing will be non coupled codes (pleasantly parallel)
 I hope for better algorithms 
 More standards … ah best-practice
Argonne National
Laboratory
M5 -- a metagenomics data sharing
infrastructure for a democratized
sequencing world
M5 = metagenomics, metadata, meta-analysis, models, and metainfrastructure
M5 – The initial goal
 Enable similarities in one non-redundant database
against:
– GenBank, RefSeq, KEGG, UniProt, SEED, IMG,
EggNogs
 One computation can be used for all tools
 Store and transport similarities to avoid recomputing
 Proposed MTF
– Metagenome Transport Format
– v0.1
 Benefit: compute once, use in ALL tools
Argonne National
Laboratory
M5 – the long term goal
Establish community best practice
It will also lead to the ability to outsource some
computing tasks for the community
Maintaining large metagenomic data sets will
overwhelm all bio computing centers I know
Searches against published metagenomes will
become impossible if we don’t established a
curated body of metagenomes
Argonne National
Laboratory
(A part of) the proposed M5 Platform
e.g. OLCF, ALCF, TeraGRID, OSG, …
HPC
define
Reference
data set
CLOUD
TeraGRID
export MTF
SOP/Workflo
w repository
Standards
in
Genomics
IMG
MG
RAST
CAMER
A
USERS (large scale)
• Environental metagenomics: GOS,
Terragenome
• Microbiota in human health: HMP
Many smaller user groups
Xyz..
Back to Summary
Argonne National
Laboratory
Summary
Overall genomics is not limited by lack of cycles
Lack of good codes and best practice is more
limiting
Adjusting to “large data”
Will become important for HPC community to
recognize good use of cycles
And help avoid “stupid computes”
Remember 
Argonne National
Laboratory
Bioinformatics
Interesting novel questions
 Which microbes are where on the planet
– Microbial weather
 Which genes are where
– Gene migration patterns
– Combinations of genes
– Where do pathways or clusters travel
 Integration of climate data
 Predictive models
– How will X impact the microbes
Argonne National
Laboratory
Abundance of machines
Evolution of computing infrastructure
for BIO
Before genomics
Argonne National
Laboratory
Early genomics
genomics era
2010
Why should I care?
 Microbes determine the climate on the planet! (e.g.  Falkowski et
al Science 2008)
 Microbes impact human health (e.g. obesity  Turnbaugh et al,
Nature 2006)
 Computations are pleasantly parallel, but there are a lot of them
 Example: Oil spill, integrate gene inventory data with oil spill patterns
Argonne National
Laboratory
Argonne National
Laboratory
Download