Data-Intensive Research Theme Genomics Mario Caccamo

advertisement
Data-Intensive Research Theme
Genomics
Mario Caccamo
mario.caccamo@bbsrc.ac.uk
as
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
1
•
•
•
•
•
•
•
•
•
•
How many data sources do you handle?
How much data do you have already?
How much data do you expect?
How urgent are your analyses?
How complex is the data?
How complex are the analyses?
How many researchers work with this data or these
methods?
How many do you anticipate?
What are your measures of success/satisfaction?
How
as many other places/groups in your discipline have
nearly the same requirements?
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
2
How many data sources ….
HiSeq 2000
75-120 bases reads
~200 Gb / expt
as
~8 days
AB SOLiD 4 System
50 bases reads
~100 Gb / expt (300Gb 4hq)
~10 days
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
3
How much data now….
as
Michael R. Stratton, Peter J. Campbell & P. Andrew Futreal
Nature 458, 719-724(9 April 2009)
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
4
The “Next Generation Gap”
•
Large dataset
•
Error models / quality scores
•
New applications
•
Skills shortage
•
Hardware
We are closing the gap…
•
Novel algorithms
•
New standards
•
Repositories
•
Intelligent storage
•
as data-intensive biology…
Towards
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
5
“Sequencing Centres”
as
James Hardfield - CRUK
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
6
Data sharing model
Sequencing
Instrument
ACGTTTCCC….
Sequencing
Instrument
Sequencing
Instrument
ACGTTTCCC….
ACGTTTCCC….
Storage
High-Performanance
Cluster
Storage
Scientist / User
Sequencing Centre
Download
Submission
as
Multi Peta-byte
High-Performanance storage
Public Repository
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
7
How much data…
as
C. Southan & G. Cameron
“The Fourth Paradigm”
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
8
How much data to expect…
as
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
9
www.uk10k.org
as
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
10
as
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
11
as
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
12
How complex...
Variomics
Transcriptomics
Genomics
Phenomics
Proteomics
as
Metagenomics
Metatranscriptomics
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
13
De Novo Pipelines – large genomes
Generation of
Sequence
Assemble
contigs
(N50 ~1Kb)
Extend contigs
(N50 ~10Kb)
Build Scaffolds
(N50 ~100Kb-1Mb)
Annotation
(~35K genes)
Integration of red pair
information
NGS
Sequencing
Cortex/AByS
S
Paired end
reads 2kb, 6kb, 10kb
Integration of
community resources
Transcriptomics
Expression
Diversity
BAC-ends, genetic markers,
long reads (454), sequenced BACs
Curtain/Velv
et
Integration
Annotation
NNNNN
SNP (G->A)
as
NNNNN
NNNNN
NNNNN
NNN
NNN
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
14
Research community…
•
Molecular Biologists
•
Bioinformaticians
•
Statisticians
•
Sociologists, Ethicists, Policy-makers, Physicians….
as
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
15
Success?
DNA
Cell
…A C T T T G G G C C C C…
…T G A A A C C C G G G G…
as
Proteins (eg hemoglobin, myoglobin, myosin)
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
16
as
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
17
Integromics
Omics classes
Genomics
Transcriptomics
Proteomics
Metabolomics
Phenomics
Sequence
Information
Pathways (Reactome)
Structure
Data / DBMS
APIaPIAPI
API
Expression
Pathways
Bioinformatics
as
Science
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
18
Current Model
Sequencing
Instrument
ACGTTTCCC….
Sequencing
Instrument
Sequencing
Instrument
ACGTTTCCC….
ACGTTTCCC….
Storage
High-Performanance
Cluster
Storage
Scientist / User
Sequencing Centre
Download
Submission
as
Multi Peta-byte
High-Performanance storage
Public Repository
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
19
Sequencing Node
Sequencing
Instrument
ACGTTTCCC….
Sequencing
Instrument
Sequencing
Instrument
ACGTTTCCC….
ACGTTTCCC….
Staging
Storage
LIMS
Primary Analysis
QA
Assemblies
Multi Peta-byte
High-Performanance storage
as
Cloud infrastructure
VM test
enviromemt
Metadata
Analysis Output
User
High-Performanance
Cluster
Analysis Submission
Virtual
Machine
Pool
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
20
Summary
•
•
Sequencing Technologies
–
Single-molecule
–
Cheaper ($1000 genome)
–
Faster (on-line sequencing)
Hardware
–
Connectivity
–
Storage
–
HPC clusters
–
Cloud Computing (GrayWulf)
as
•
Software
–
–
Bioinformatics
Data-Intensive Research Theme - Genomics – Mario Caccamo – TGAC
21
Download