Sequencing genomes

advertisement
Last lecture summary
• recombinant DNA technology
• DNA polymerase (copy DNA), restriction endonucleases (cut DNA),
ligases (join DNA)
• DNA cloning – vector (plasmid, BAC), PCR
• genome mapping
relative locations of genes are established by
following inheritance patterns
visual appearance of a chromosome when
stained and examined under a microscope
the order and spacing of the genes, measured
in base pairs
sequence map
• genetic markers
• polymorphic (alternative alleles)
• restriction fragment length polymorphisms (RFLPs)
• some restriction sites exist as two alleles
• simple sequence length polymorphisms (SSLPs)
• repeat sequences, minisatellites (repeat unit up to 25 bp),
microsatellites (repeat unit of 2-4 bp)
• single nucleotide polymorphisms (SNPs, pron.: “snips”)
• Positions in a genome where some individuals have one nucleotide and
others have a different nucleotide
RFLP
SSLP
New stuff
DNA sequencing
• Sanger method, chain-termination method,
developed 1974, Nobel prize in chemistry 1980
• The key principle: use of dideoxynucleotide triphosphates
(ddNTPs) as DNA chain terminators.
dNTP
ddNTP
source: http://openwetware.org/wiki/BE.109:Bio-material_engineering/Sequence_analysis
source: wikipedia
Shotgun sequencing
Target DNA
Copies of target DNA
Shotgun (restriction endonuclease)
Sequence each short piece
(read, ~1kbp)
Sequence assembly
contig
Consensus
Finalizing (directed read)
source: slides by Martin Farach-Colton
Problems with repeats in the assembly
source: Brown T. A. , Genomes. 2nd ed. http://www.ncbi.nlm.nih.gov/books/NBK21129/
Human genome project (HGP)
• Determine the sequence of haploid human
genome
• Govermentally funded (DOE)
• Began in 1990, working draft published
in 2001, complete in 2003, last chromosome
finished in 2006
• Cost: $3 billion
• Whose genome was sequenced?
• The “reference genome” is a composite from several people who
donated blood samples.
Celera - competition begins
• In 1998, a similar privately
funded quest was launched
by the American researcher
Craig Venter and his company
Celera Genomics.
• Finish the genome sequencing
within 3 years for $300,000,000.
• Celera wanted to patent identified genes.
• Celera promised to release data annually (while the HGP
daily). However, Celera would, unlike HGP, not permit free
redistribution or scientific use of the data.
• HGP was compelled to release (7.7. 2000) the first draft of the
human genome before Celera for this reason.
How did it finish?
• March 2000 – president Clinton announced that the
•
•
•
•
genome sequence could not be patented, and should be
made freely available to all researchers.
The statement sent Celera's stock plummeting and
dragged down the biotechnology-heavy Nasdaq. The
biotechnology sector lost about $50 billion in two days.
Celera and HGP annouced jointly the draft sequence in
2000.
The drafts covered about 83% of the genome.
Improved drafts were announced in 2003 and 2005, filling
in to ≈92% of the sequence currently.
Human genome – some facts
• 3 billions bps, ~20 000 – 25 000 genes
• Only 1.1 – 1.4 % of the genome sequence codes for proteins.
• State of completion:
• best estimate – 92.3% is complete
• problematic unfinished regions: centromeres, telomeres (both contain
highly repetitive sequences), some unclosed gaps
• It is likely that the centromeres and telomeres will remain unsequenced
until new technology is developed
Databases
• Genome is stored in databases
• Primary database (NCBI) – Genebank
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide)
• Additional data and annotation, tools for visualizing and searching
• UCSC (http://genome.ucsc.edu) … University of California – Santa Cruz
• Ensembl (http://www.ensembl.org) … EBI+Sanger
• Chromosome
• largest #1 = 250 Mbp, smallest #21 = 48 Mbp
• http://www.ensembl.org/Homo_sapiens/Location/Genome
Hierarchical genome shotgun – HGS
• Hierarchical genome shotgun, hierarchical shotgun
sequencing, clone-by-clone sequencing, map-based
shotgun sequencing, clone contig sequencing
• Adopted by HGP
• Strategy “map first, sequence second”
• Create physical map
• Divide chromosomes into smaller fragments.
• Order (map) them to correspond to their respective
locations on the chromosomes.
• Determine the base sequence of each of the mapped
fragments.
Multiplied genomic
DNA
restriction endonuclease
BAC fragments
160 kbp
Minimum tiling
path (MTP)
clone in BAC
BAC to be
sequenced
restriction endonuclease
Shotgun clones
1 kbp
clone in plasmid
Sequencing and
assembly
http://www.nature.com/scitable/content/idealized-representation-of-the-hierarchical-shotgun-sequencing-48221
Minimum tiling path (MTP)
• MTP – the lowest possible number of BACs to cover the
sequence.
• MTP BACs are selected for sequencing.
http://en.wikipedia.org/wiki/Shotgun_sequencing
Hierarchical genome shotgun – HGS
1. Map genome
• As genetic markers (landmarks), short tagged sites (STS) were used
(200 to 500 base pair DNA sequence that has a single occurrence in
the genome)
2. Copy target DNA
3. Make BAC library
• cleave randomly (partial cleavage by restriction endonuclease) all
target DNA copies into ~160kbp fragments, clone them in BACs
4. Physically map all BACs
5. Identify a minimum tiling path (MTP) BACs
6. Shotgun sequence only BACs at MTP
• Divide BACs into ~1kbp fragments, do plasmid cloning, reconstruct
BAC sequence
7. Fill in gaps between BACS
8. Merge into consensus sequence
Coverage
• As it was shown, individual nucleotides are read more
•
•
•
•
•
•
than once.
Coverage is the average number of reads representing a
given nucleotide in the reconstructed sequence.
Let’s say that for a source strand of length G = 100 Kbp
we sequence R = 1 500 reads of average legth L = 500.
Thus, we collect N = RL = 750 Kbps of data.
So we have sequenced on average every bp in the
source N/G = 7.5 times.
The coverage is 7.5X
Coverage in HGS adopted by HGP was 8X.
Whole genome shotgun – WGS
• Adopted by Celera, expensive and time consuming
mapping is skipped, high coverage (20x) needed, new
algorithms for assembly, repeats are problematic (HGP
data used by Celera)
http://en.wikipedia.org/wiki/Shotgun_sequencing
Genome assembly
• Can be very computationally intensive when dealt at the
whole genome level.
• Major challenges:
• sequence errors – can be corrected by drawing consensus
sequence from an alignment of multiple overlapped sequences
• contamination by bacterial vectors – can be removed using filtering
programs prior to assembly
• repeats
• Popular programs developed within HGP and still used:
PHRED and PHRAP
PHRED
• Base caller – convert raw data from a sequencing
instrument into sequences and scores, score reflects the
likelihood the base is correct/incorrect
ideal case
real case
PHRAP
• Sequence assembler
• Takes PHRED base-call files with quality scores as input.
• Aligns individual fragments in a pairwise fashion. The
base quality information is taken into account during the
pairwise alignment.
Download