“Big Data” Bioinformatics Mario Caccamo (TGAC)

advertisement
“Big Data” Bioinformatics
Mario Caccamo
(TGAC)
Understanding Variation
high yield
low yield
…A C T T T G G G C C C C…
…A C T T T G A G C C C C…
Understanding Variation
resistant
susceptible
…A C T T T G G G C C C C…
…A C T T T G A G C C C C…
Genetic Variation
‘Omics
Variomics
Transcriptomics
Genomics
Phenomics
Proteomics
Metagenomics
Metatranscriptomics
Integromics
Omics classes
Genomics
Transcriptomics
Proteomics
Metabolomics
Phenomics
Sequence
Information
Pathways (Reactome)
Structure
APIaPIAPI
Data
/ DBMS
API
Expression
Pathways
Bioinformatics
Portal/Publication
Assembly
Genome
Genome is fragmented.
Sequence fragments - traces.
ACCTG..
CCTTG..
Assemble fragments - contigs.
Link contigs into scaffolds.
NNNNN
5x
Repeats
A1
A1
A2
A3
>90%
DNA
A2
De Bruijn Graphs - Contigs
CAACAA
A
CAACAA
A
GCAACA
A
CGCAA
CA
AACTAACGAC G CGCA T CAAAA
ACTAACGAC T CGCA T CAAAA
ACTAACGAC G CGCA A CAAAA
GCGCA
AC
1x
CGCGC
AA
AACTAA
C
1x
ACTAAC
G
3x
CTAACG
A
TAACGA
C
3x
3x
2x
2x
AACGAC
G
ACGAC
GC
2x
CGACG
CG
2x
GACGC
GC
2x
ACGCG
CA
1x
CGCGC
AT
GCGCAT
C
CGCATC
A
TCGCAT
C
AACGAC
T
CTCGCA
T
ACGACT
C
CGACTC
G
GACTCG
C
ACTCGC
A
GCATCA
A
2x
CATCAA
A
2x
ATCAAA
A
2x
N
(N(LK )/G)
P(d  0) 1 e
L
>K
E. coli Reference
E. coli Reference
Sequencing Technologies
Michael R. Stratton, Peter J. Campbell & P. Andrew Futreal
Nature 458, 719-724(9 April 2009)
Heavy Weights
HiSeq 2000
75-120 bases reads
~200 Gb / expt
~8 days
AB SOLiD 4 System
50 bases reads
~100 Gb / expt (300Gb 4hq)
~10 days
“Sequencing Centres”
James Hardfield - CRUK
We want more….
•
“Genome Zoo” – 10K vertebrates
•
1000 plants
•
1001 Arabidopsis
•
Cancer genomes
•
Single cell?
Single-Molecule Sequencing
Single molecule: primer immobilized
DNA sequence detection as molecules pass through a nm - sized pore
Single molecule sequencing by enzyme tethered in 20 nm hole
Bioinformatics
Bioinformatics
Data
Light Weights
SOLiD PI System
GS Junior
iPod dock
Ion Torrent
iSCAN
Embracing Complexity
“Solving the puzzle of complex diseases, from obesity to cancer, will require
holistic understanding of the interplay between factors such as genetics,
diet, infectious agents, environment, behavior, and social structures.”
Elias Zerhouni, The NIH Roadmap, Science2003, 302:63- 64.
“Our brains are wired for narrative, not statistical
uncertainty” – Francis Bacon (I think)
The Genome Analysis Centre
A new facility to provide critical mass and excellence in genomics specialised
in animal, microbial and plant research:
- high throughput sequencing
- new technology platforms
- bioinformatics
- impact through innovation and enterprise
TGAC will complement work being carried out at The Wellcome Trust Sanger
Institute and MRC / NERC Centres.
Norwich
Norwich Research Park
University
Hospital
UEA
JIC
TSL
TGAC
IFR
www.tgac.bbsrc.ac.uk
Next Generation Platforms
Pyrosequencing
Sequencing by synthesis
Sequencing by ligation
> 400 bases / read
>75 base paired end reads
50 base paired end reads
~ 500 Mb / expt
>30 Gb / expt
>30 Gb / expt
Data Analysis Pipeline
TGAC
Sequencing
Instrument
Sequencing
Instrument
Genome Technology Division
Sequencing Informatics
Sequencing
Instrument
Primary Analysis
Web Interface
LIMS
Genome Analysis
Computational Genomics
Internal Users
External Users
Annotated
High quality data
Public
Repositories
Repositories
“Submitter
decides”
Community
Input
EBI
Genome
Archive
EMBL
Archive
Sequencing
Centre
Trace
Archive
Annotated
genomes
“Communi
ty
decides”
Sequencing
Centre
Genome Browsers
DAS
genome browser
local storage
reference sequence
DAS client
XML
DAS server
DAS server
DAS server
remote storage
remote storage
remote storage
Browsers
DAS server
Visualisation
API
Data
Integration
Programmatic Access
Data mining
Data Sharing
DAS tracks
Current Model
Sequencing
Instrument
ACGTTTCCC….
Sequencing
Instrument
ACGTTTCCC….
Storage
High-Performanance
Cluster
Sequencing
Instrument
ACGTTTCCC….
Scientist / User
Sequencing Centre
Download
Submission
Multi Peta-byte
High-Performanance storage
Public Repository
Storage
EMBL Repositories
C. Southan & G. Cameron
“The Fourth Paradigm”
Sequencing Centre
Sequencing
Instrument
Sequencing
Instrument
ACGTTTCCC….
Sequencing
Instrument
ACGTTTCCC….
LIMS
Staging Storage
Primary Analysis
QA
Assemblies
Multi Peta-byte
High-Performanance storage
VM test
enviromemt
ACGTTTCCC….
Metadata
Analysis Output
User
High-Performanance
Cluster
Analysis Submission
Virtual
Machine
Pool
Cloud infrastructure
Summary
•
Sequencing Technologies
–
–
–
•
Hardware
–
–
–
–
•
Single-molecule
Cheaper ($1000 genome)
Faster (real-time sequencing)
Connectivity
Storage
HPC clusters
Cloud Computing (GrayWulf)
Software
–
–
–
–
–
Bioinformatics
Staff
New paradigms
Data Integration & Dissemination
Data visualisation
Thanks!
Download