3 rd Gen DNA Sequencing

advertisement
3nd Essential Practical
Bioinformatics Workshop
Introductory Lecture:
Why Bioinformatics?
Tan Tin Wee
Department of Biochemistry, YLL School of Medicine, NUS
and
Victor Tong Joo Chuan
I2R, Department of Biochemistry, YLL School of Medicine (Adjunct), NUS
Mohammad Asif Khan
Perdana University Graduate School of Medicine, Malaysia
1
Some definitions of Bioinformatics
Bioinformatics is “the development and
application of techniques from computer
science, mathematics and statistics to
address biological problems”
Dubitzky (Brief Bioinform. 2009; 10: 343)
Bioinformatics is “the study of the
information content and information flow
in biological systems and processes”.
Michael Liebman in “Bioinformatics: An Editorial Perspective”
(http://www.netsci.org/Science/Bioinform/feature01.html)
2
Information flow in
Biology
How does the plane fly from A to B?




A 747-400 has six million parts
A 747-400 has 274 km of wiring and 8 km of tubing.
Seventy-five thousand engineering drawings were used
to produce the first 747.
Will we be able to understand how a pilot flies or lands
the plane ie. the behaviour of a plane, if we
took it all apart?
Can I get to understand how the ppt
presentation works if I take the parts of
my computer apart and analyse every
chip and every transistor?
What drives us to study the Life Sciences?
Unraveling the Mysteries of Life!
SCALE
Time
Space
Image: http://instruct1.cit.cornell.edu/courses/bioes278_lecture/Topic1Basic_Concepts/07-Levels_of_organisation.jpg (Accessed Aug 16, 2006)
6
Why Bioinformatics?
Bioinformatics is at the beginning of
taking Biology to the next level:
 Studying living things with
information technology, and with
computing systems, and
 Thinking about living systems with
information theory.
Unraveling the Mysteries of Life!
With Information Theory?
Soul? Quantum Information?
Mind and Consciousness
Brain
Network of Neurons
Epigenetic Code
DNA Genetic Information
http://www.labgrab.com/
users/triptangent/blog/
laser-induced-gamma-wavesoffer-insight-brain-functions
DEPTH
8
Living Things and
Life Sciences are special!
Information
Energy
Matter
Information flow in
Biology
Central Dogma and the “omics”
DNA
RNA
Protein
Genomics
Transcriptomics
Proteomics
Regulation
Interactomics
Metabolism
Metabolomics
Degradation/degradomics, Immunity/immunomics…etc 11
Structure of DNA
JD Watson and FHC Crick
April 25, 1953
Nature 171, 737-738 (1953)
Fred Sanger (1975)
Dideoxy chain termination
DNA sequencing
J. Mol. Biol. 94 (3): 441–8
Kary Mullis (1983)
Polymerase Chain Reaction
(PCR)
Human Genome Project formally
Began in October 1990, funded
by DoE and NIH, USA
Completion of the first assembly
of the human genome
June 26, 2000 in Washington
Bill Clinton with Craig Venter and
Francis Collins 13 years $300 million
1995
Haemophilus
(Bacteria)
1.6 Mb, ~1600
genes
[Science 269:
496]
1997
Eukaryote
(budding
yeast), 13 Mb,
~6K genes
[Nature 387: 1]
1998
C elegans
(Worm,
Animal)
~100 Mb,
~20K genes
[Science
282: 1945]
2000
Human,
~3 Gb,
~100K
genes
Genomes
highlight the
Finiteness
of the “data”
in Biology
“Next Gen” and 3rd Gen Sequencing
454 Life Sciences
Pyrosequencing (2005)
Illumina (Solexa) (2006)
Latest HiSeq up to 200 Gb per run, 2 x 100 bp read length,
up to 25 Gb per day
In a single run, sequence two human genomes at ~30x coverage
for less than $10,000 (USD) per genome
ABi (Life Technologies) (2007)
SOLiD
Sequencing by Oligonucleotide Ligation and Detection)
60 gigabases of usable DNA data per run
3rd Gen DNA Sequencing (New!)
Single Molecule PCR-independent Sequencing
- Heliscope Single Molecule Sequencer (2009)
-Pacific Biosciences SMRT™ (2009)
100 Gb/hr >1000base/read 2010
5 days: Haitian Cholera epidemic genome completed
Jonas Korlach & Steve Turner
14
3rd Gen Sequencing
Impact of
 RNA-seq for transcriptomics
 BS-seq (bisulfite seq) for DNA methylations
 ChIP-seq for DNA-protein interactions
 CNV-seq for copy number variations
Newer 3rd Gen DNA Sequencing (New!)
-Nanopore sequencing
(Oxford Nanopore and UCSC)
-towards electronic, single molecule
DNA sequencing of DNA strands
-http://www.nanoporetech.com/press_releases/detail/114
- Ion Torrent
- Quantum Dot sequencing
(future?)
Bioinformatics is crucial
for Personal Genomics
Another Wave
Of Exponential Growth?
The $1,000 Human Genome?
Exponential Growth
23andMe.com
deCODEme
Navigenics
Knome
Hellogenome (Theragen)
Complete Genomics (USA)
Beijing Genome Institute (BGI)
Etc etc
Bioinformatics in genomics


“the development and application of
techniques from computer science,
mathematics and statistics to address
biological problems”
Solving the data deluge in






Personal genomics
Cancer genomics
1000 genomes project
How to store the data?
How to analyse the data?
How to extract information and knowledge
from the data?
From genome (1D) to structure (3D)
Crystal to Structure Pipeline
Crystal
Management
Crystal
Mounting
Crystal
Alignment
Crystal
Description
?





Data
Collection
Structure
Determination
A
S
A
P
Automation of individual process steps
Systems integration of automated procedures
Knowledge-based approach to structure determination
Crystal description based on diffraction pattern & micro-beam exposures
Structure Determination based on ASAP; processing engines developed by
collaborators
© John Wooley, UCSD 2003
The Industrial Scale Discovery PipeLine of JCSG
HT Pipeline Processes, Bottlenecks and Leaks
target
selection
expression
cloning
imaging
harvesting
bl xtal mounting
xtal screening
publication
PDB
annotation
purification
crystallization
data collection
struc. validation
phasing
tracing
struc. refinement
© John Wooley, UCSD 2003
Growth of Protein Databank PDB
Structures
Common cold: structure of the protein shell, or capsid,
of the human rhinovirus. Credit: J. Y. Sgro, UW-Madison
Molecules  complexes, pathways
Regulating biological processes
Jak-STAT Signaling Pathway. KEGG Pathways.
http://www.genome.jp/kegg/pathway/hsa/hsa04630.html
(Accessed: Aug 5, 2011)
22
Multiple pathways in a cell…
Cells have multiple processes that must be
coordinated
Metabolism Pathways.
KEGG Pathways.
http://www.genome.jp/k
egg/pathway/map/map
01100.html (Accessed:
Aug 5, 2011)
23
Parts of a machine
Image:
http://www.edwardsheattreating.co
m/images/machine%20parts.jpg
Image:
http://www.sperdvac.org/Horizontal
%20Mill/milling%20machine.jpg
And so, we study the individual parts of the machine
in order to understand the machine itself.
24
Flagellar
system
http://www.fbs.osaka-u.ac.jp/en/seminar/image/09_img13.jpg
© Protonic NanoMachine Project,
ERATO, Japan (Namba)
http://www.fbs.osaka-u.ac.jp/en/seminar/image/09_img12.jpg
Interactomics
Which proteins
(biomolecules)
interact with which
proteins
(biomolecules)?
 Pathway information

Stanyon et al. Genome Biology 2004 5:R96
26
Cellular “Circuitry”
Representing and simulating the well known genetic switch mechanism of l lambda phage to
choose between lysis and lysogeny growth pathways – e.g. hybrid functional Petri net
technique
http://www.genomicobject.net/member3/GONET/lambda.htm
l
E-cell
- Model builder
- Algorithm modules
- Simulation visualiser
© E-cell.org
Tissue-Organ Modeling –
Simulating heart myocardial function
The Physiome
Project
© Peter Hunter http://ep.physoc.org/cgi/content/full/89/1/1
BioImaging Technologies
With Genomics, Proteomics
Computational Biology
© Nature Cell Biology 2003
Integrative Biology
© Nature Cell Biology 2003
Novartis Institute for Tropical Diseases
(NITD) STOPDengue Project
Integrative Biology
helping Industry
© Nature Cell Biology 2003
Bioinformatics and Computational
Biology
Underpin Integrative Biology
to handle complexity
of biological data
3D
1D
6D
2D
Spectrum of Bioinformatics
and Computational Biology
Tissue/Organ Physiology
“E-Cell” simulation
COMPUTATIONAL
INFORMATIONAL
Regulatory Networks/Circuits
Pathways
In silico research
Interactions
Function
Structure
Sequence
Genomics
Proteomics
Transcriptomics
Metabolomics
EXPERIMENTAL
Other ‘omics
OBSERVATIONAL
BioImaging
In vitro
In vivo research
Biology is Big Science these days
After the Genomes projects, industrial scale
generation of data is no big deal.
Sophisticated bioinstrumentation from
automated sequencers to microarray systems
24by7 churn out ever increasingly large scales
of output, throughput and data generation.








Stanford BioX (US$ 150M)
MIT CSBi (US$10M/yr)
Princeton Sigler Inst for
Integrative Genomics ICAHN
Lab (US$40M)
Duke Institute for Genome
Sciences and Craig Venter’s
TCAG (US$250M)
UMichigan LSI (US$380M)
QB3 UCaliforniaSF
/Scruz/Berkeley (US$200M)
Cornell LSI (US$140)
UCSD JCSG
© GenomeWeb LLC 2003
Singapore’s Mechanobiology Institute
S$150M
Bioinformatics and Computational Biology
underpins the research process.
The Biological Data Deluge
of Volume and Complexity
Critical Need for
Bioinformatics
and Computational Biology
expertise in
the next generation
of Biologists
20th C: Century of Physics and the Atom Bomb
21st C: Century of Biology and Biotechnology
The Economist
Feb/Mar 2010
What does all of this have to do with you???
Basic IT and Computer Literacy in Biologists?
39
Basics of Bioinformatics in Brief
Bioinformatics applications
are made up of many
combinations ...
Visualisation
Data
Database
Algorithm
40
Biological Databases




Collect, organize
and classify
data
Query the
dataset
Retrieve entries
based on
keyword search
Limitations of
databases
Image from Entrez Gene.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene (Accessed Aug 5, 2011)
41
Sequence Analysis and other
bioinformatics software





Why bioinformatics software?
Examples of selected software
Scope and sources of bioinformatics
software
Cautions re: bioinformatics software
Accessing and using bioinformatics
software
42
Sequence Comparison, Alignment,
Assembly






After collecting a set of related
sequences, how can we compare
them as a set?
How should we line up the
sequences so that the most
similar portions are together?
What do we do with sequences
of different lengths?
How can we compare a given
sequence to the millions in the
database?
Which ones are truly related by
evolution?
What can the study of related
sequences tell us?
43
Patterns and Motifs



What is the signature
found in groups of
sequences?
How can we use these
signature patterns or
motifs for rapid
identification of familial
relationships?
Can patterns be used
to assign function?
Image: http://www44
lecb.ncifcrf.gov/~toms/sequencelogo.html
Evolution and
Phylogenetic Analysis




What is a
phylogenetic tree?
Algorithms used to
generate a
phylogenetic tree
Decisions governing
choice of
phylogenetic
programs
Interpreting
genome structure
using phylogeny
Image: David Begun
http://www.newsandevents.utoronto.
ca/bios/askus4.htm
45
Structure Visualization






Using graphic tools to view
structures
Simple commands to
analyse structures and
active sites
Different graphic
representations and
colouring schemes
The function of a protein is
a consequence of its folded
state: Anfinsen, 1961
The 3D fold of a protein is
called its structure
In 3D, the business end of
the protein has
contributions from different
regions of its sequence
Image: Eric Martz
RasMol Gallery.
http://www.umass.edu/microbio/
rasmol/galmz.htm
46
Course objectives of the workshop

Students will be sufficiently familiar with a set of
common bioinformatics resources, and their underlying
concepts, such that:

You will be able to apply these resources appropriately to
solve biological questions.

You will be able to independently identify, use and assess
additional resources as they become available.

You will be prepared to pursue more advanced
bioinformatics training.

You will begin to think computationally and informatically
in doing biology
47
Bioinformatics and
Computational Thinking in Solving
Problems in Biology
Database
Algorithms
Analysis
Visualization
Biology
Software Applications
48
READINGS:



Johnathan Pevsner (2009) Bioinformatics and
Functional Genomics (2nd edition). Wiley-Blackwell
Chapters 1 and 2; relevant database
sections in Chapters 19 and 20
See http://bioinfbook.org/
Arthur Lesk (2008) Introduction to Bioinformatics.
(3rd ed) Chapters 3 and 4. Oxford University Press.
(Optional)
For Practicals – if you are still lost.
Jean-Michel Claverie and Cedric Notredame (2007)
Bioinformatics for Dummies. (2nd Edition) Wiley
Publishing.
(optional)
Download