Last lecture summary

advertisement
Last lecture summary
Sequencing strategies
• Hierarchical genome shotgun HGS – Human Genome
Project
• “map first, sequence second”
• clone-by-clone … cloning is performed twice (BAC, plasmid)
Sequencing strategies
• Whole genome shotgun WGS – Celera
• shotgun, no mapping
• Coverage - the average number of reads representing a given
nucleotide in the reconstructed sequence. HGS: 8, WGS: 20
Human genome
• 3 billions bps, ~20 000 – 25 000 genes
• Only 1.1 – 1.4 % of the genome sequence codes for proteins.
• State of completion:
• best estimate – 92.3% is complete
• problematic unfinished regions: centromeres, telomeres (both contain
highly repetitive sequences), some unclosed gaps
• It is likely that the centromeres and telomeres will remain unsequenced
until new technology is developed
• Genome is stored in databases
• Primary database – Genebank (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide)
• Additional data and annotation, tools for visualizing and searching
• UCSCS (http://genome.ucsc.edu)
• Ensembl (http://www.ensembl.org)
New stuff
Personal human genomes
• Personal genomes had not been sequenced in the
Human Genome Project to protect the identity of
volunteers who provided DNA samples.
• Following personal genomes were available by July 2011:
• Japanese male (2010, PMID: 20972442)
• Korean male (2009, PMID: 19470904)
• Chinese male (2008, PMID: 18987735)
• Nigerian male (2008, PMID: 18987734)
• J. D. Watson (2008, PMID: 18421352)
• J. C. Venter (2007, PMID: 17803354)
• HGP sequence is haploid, however, the sequence maps
of Venter and Watson are diploid.
Next generation sequencing (NGS)
• The completion of human genome was just a start of
modern DNA sequencing era – “high-throughput next
generation sequencing” (NGS).
• New approaches, reduce time and cost.
• Holly Grail of sequencing – complete human genome
below $ 1000.
1st and 2nd generation of sequencers
• 1st generation – ABI Prism 3700 (Sanger, fluorescence, 96
capillaries), used in HGP and in Celera
• Sanger method overcomes NGS by the read length (600 bps)
• 2nd generation - birth of HT-NGS in 2005. 454 Life
Sciences developed GS 20 sequencer. Combines PCR
with pyrosequencing.
• Pyrosequencing – sequencing-by-synthesis
• Relies on detection of pyrophosphate release on nucleotide
incorporation rather than chain termination with ddNTs.
• The release of pyrophosphate is detected by flash of light
(chemiluminiscence).
• Average read length: 400 bp
• Roche GS-FLX 454 (successor of GS 20) used for J.
Watson’s genome sequencing.
3rd generation
• 2nd generation still uses PCR amplification which may
introduce base sequence errors or favor certain
sequences over others.
• To overcome this, emerging 3rd generation of
seqeuencers performs the single molecule sequencing
(i.e. sequence is determined directly from one DNA
molecule, no amplification or cloning).
• Compared to 2nd generation these instruments offer
higher throughput, longer reads (~1000 bps), higher
accuracy, small amount of starting material, lower cost
Moore’s law
source: http://www.genome.gov/27541954
Cost per genome
1 363$
source: http://www.genome.gov/27541954
Cost per megabase
5000 $
1.5 centu
Illumina HiSeq X Ten
• 14. 1. 2014 Illumina anounced
the new HiSeq X Ten
Sequencing System.
• Illumina claims they are enabling
the $1,000 genome.
• Uses Illumina SBS technology
(sequencing-by-synthesis).
• It sells for at least $10 million.
Human Longevity
• 4. 3. 2014 – Human Longevity was founded by Craig
•
•
•
•
Venter
Its main aim: to slow down the process of ageing
The largest human DNA sequencing operation in the
world, capable of processing 40,000 human genomes a
year.
DNA data will be combined with other data on the health
and body composition of the people whose DNA is
sequenced, in the hope of gleaning insights into the
molecular causes of aging and age-related illnesses like
cancer and heart disease.
Equipment: 2x Illumina Hiseq X Ten
Which genomes were sequenced?
• http://www.ncbi.nlm.nih.gov/sites/genome
• GOLD – Genomes online database
(http://www.genomesonline.org/)
• information regarding complete and ongoing genome projects
Important genomics projects
• The analysis of personal genomes has demonstrated,
how difficult is to draw medically or biologically relevant
conclusions from individual sequences.
• More genomes need to be sequenced to learn how genotype
correlates with phenotype.
• 1000 Genomes project (http://www.1000genomes.org/), 2009-2012.
Sequence the genomes of at least a 1000 people from around the
world to create the detailed and medically useful picture of human
genetic variation. 2nd generation of sequencers is used in 1000
Genomes.
• 10 000 Genomes (UK10K), 2010-2013.
• 100 000 Genomes, started 2012, should be finished in 2017.
Sequence Alignment
What is a sequence alignment?
CTTTTCAAGGCTTA
GGCTTATTATTGC
Fragment overlaps
CTTTTCAAGGCTTA
GGCTATTATTGC
CTTTTCAAGGCTTA
GGCT-ATTATTGC
What is a sequence alignment ?
CCCCATGGTGGCGGCAGGTGACAG
CATGGGGGAGGATGGGGACAGTCCGG
TTACCCCATGGTGGCGGCTTGGGAAACTT
TGGCGGCTCGGGACAGTCGCGCATAAT
CCATGGTGGTGGCTGGGGATAGTA
TGAGGCAGTCGCGCATAATTCCG
CCCCATGGTGGCGGCAGGTGACAG
CATGGGGGAGGATGGGGACAGTCCGG
TTACCCCATGGTGGCGGCTTGGGAAACTT
TGGCGGCTCGGGACAGTCGCGCATAAT
CCATGGTGGTGGCTGGGGATAGTA
TGAGGCAGTCGCGCATAATTCCG
TTACCCCATGGTGGCGGCTGGGGACAGTCGCGCATAATTCCG
consensus
Why align sequences
• The draft human genome is available
• Automated gene finding is possible
• Gene:
AGTACGTATCGTATAGCGTAA
• What does it do?
• One approach: Is there a similar gene in another species?
• Align sequences with known genes
• Find the gene with the “best” match
Sequence alignment
• Procedure of comparing sequences
• Point mutations – easy
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGATTCGCCCTATCGTCTATCT
gapless alignment
• More difficult example
ACGTCTGATACGCCGTATAGTCTATCT
CTGATTCGCATCGTCTATCT
• However, gaps can be inserted to get something like this
insertion × deletion
indel
ACGTCTGATACGCCGTATAGTCTATCT
----CTGATTCGC---ATCGTCTATCT
gapped alignment
Sequence alphabet
side chain charge at physiological
pH 7.4
Positively charged
side chains
Negatively charged
side chains
Polar uncharged side
chains
Special
Hydrophobic side
chains
Name
Arginine
Histidine
Lysine
Aspartic Acid
Glutamic Acid
Serine
Threonine
Asparagine
Glutamine
Cysteine
Selenocysteine
Glycine
Proline
Alanine
Leucine
Isoleucine
Methionine
Phenylalanine
Tryptophan
Tyrosine
Valine
3 letters
Arg
His
Lys
Asp
Glu
Ser
Thr
Asn
Gln
Cys
Sec
Gly
Pro
Ala
Leu
Ile
Met
Phe
Trp
Tyr
Val
1 letter
R
H
K
D
E
S
T
N
Q
C
U
G
P
A
L
I
M
F
W
Y
V
Adenine
A
Thymine
T
Cytosine
G
Guanine
C
Flavors of sequence alignment
pair-wise alignment × multiple sequence alignment
Flavors of sequence alignment
global alignment × local alignment
global
local
align entire sequence
stretches of sequence with
the highest density of
matches are aligned,
generating islands of
matches or subalignments in
the aligned sequences
Evolution
common
ancestors
wikipedia.org
Evolution of sequences
• The sequences are the products of molecular evolution.
• When sequences share a common ancestor, they tend to
exhibit similarity in their sequences, structures and
biological functions.
DNA1
DNA2
Protein1
Protein2
Sequence
similarity
Similar 3D structure
Similar function
Similar sequences produce similar proteins
However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: 11178260
Download