Genome Evolution II

advertisement
CS173
Lecture 11: Repeats II, Mutations
MW 11:00-12:15 in Beckman B302
Prof: Gill Bejerano
TAs: Jim Notwell & Harendra Guturu
http://cs173.stanford.edu [BejeranoWinter12/13]
1
Announcements
• TA HW1 Comments
http://cs173.stanford.edu [BejeranoWinter12/13]
2
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA
TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA
ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT
ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT
TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT
TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC
ATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
CGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA
ATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA
TCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC
GGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA
CTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG
GCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC
TTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT
TGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT
TCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA
TTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA
CTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT
TACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT
ACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA
http://cs173.stanford.edu [BejeranoWinter12/13]
3
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT
Transcription
http://cs173.stanford.edu [BejeranoWinter12/13]
4
Transcription Regulation
Chromatin / Proteins
Extracellular signals
DNA / Proteins
http://cs173.stanford.edu [BejeranoWinter12/13]
5
Repeats
http://cs173.stanford.edu [BejeranoWinter12/13]
6
Sequences that repeat many times in the genome
• Take up cumulatively a whooping half of the genome
• Come in two major, very different, flavors
I
II
http://cs173.stanford.edu [BejeranoWinter12/13]
7
I. Interspersed Repeats
Get a copy out of the genome, and into a new location.
http://cs173.stanford.edu [BejeranoWinter12/13]
8
II. Simple Repeats
•Every possible motif of mono-, di, tri- and tetranucleotide repeats is
vastly overrepresented in the human genome.
AAAAAAAAA
•These are called microsatellites,
CACACACAC
Longer repeating units are called minisatellites,
The real long ones are called satellites.
CAACAACAA
•Highly polymorphic in the human population.
•Highly heterozygous in a single individual.
•As a result microsatellites are used in paternity testing, forensics, and
the inference of demographic processes.
•There is no clear definition of how many repetitions make a simple
repeat, nor how imperfect the different copies can be.
•Highly variable between species: e.g., using the same search criteria
the mouse & rat genomes have 2-3 times more microsatellites than
the human genome. They’re also longer in mouse & rat.
http://cs173.stanford.edu [BejeranoWinter12/13]
9
DNA Replication
http://cs173.stanford.edu [BejeranoWinter12/13]
10
Simple Repeats Create Funky DNA structures
http://cs173.stanford.edu [BejeranoWinter12/13]
11
These Bumps Give The DNA Polymerase Hiccups
http://cs173.stanford.edu [BejeranoWinter12/13]
12
Expandable Repeats and Disease
http://cs173.stanford.edu [BejeranoWinter12/13]
13
Restriction Enzymes
• Restriction enzymes recognize and make a cut within
specific DNA sequences, known as restriction sites.
• This is usually a 4-6 base pair palindromic sequence.
• Naturally found in different types of bacteria
• Bacteria use restriction enzymes to protect themselves
from foreign DNA
• Many have been isolated and sold for use in lab work
blunt end
sticky end
http://cs173.stanford.edu [BejeranoWinter12/13]
14
DNA Fingerprint Basics
DNA fragments of different size will be produced
by a restriction enzyme that cuts at the points
shown by the arrows.
15
DNA fragments are then separated based on
size using gel electrophoresis.
16
DNA Fingerprinting can be used in
paternity testing or murder cases.
17
There are Tracks for it
http://cs173.stanford.edu [BejeranoWinter12/13]
18
Interspersed vs. Simple Repeats
From an evolutionary point of view transposons and simple
repeats are very different.
Different instances of the same transposon share common
ancestry (but not necessarily a direct common progenitor).
Different instances of the same simple repeat most often
do not.
http://cs173.stanford.edu [BejeranoWinter12/13]
19
Genome Content, Genome Function DONE
• Transcripts
• Protein coding genes
• Non-coding RNAs
• Gene regulatory elements
•
•
•
•
Promoters
Enhancers
Repressors
Insulators
• Epigenomics
• Nucleosomes, open chromatin
• Histone modifications
• Repeats
• Interspersed repeats / mobile elements
• Simple repeats
http://cs173.stanford.edu [BejeranoWinter12/13]
20
Categories are NOT mutually exclusive
• We already discussed repeat instances that became
• Coding exons
• Enhancers
• There are known genomic loci that
• Code for protein coding exons and act as enhancers.
• Ditto for non-coding RNA + enhancer.
• There are bi-direction exons
• Coding in both directions
• Coding and anti-sense
• Both non-coding
http://cs173.stanford.edu [BejeranoWinter12/13]
21
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA
TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA
ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT
ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT
TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT
TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC
ATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
CGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA
ATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA
TCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC
GGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA
CTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG
GCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC
TTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT
TGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT
TCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA
TTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA
CTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT
TACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT
ACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA
http://cs173.stanford.edu [BejeranoWinter12/13]
22
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT
Comparative Genomics
“Nothing in Biology Makes Sense
Except in the Light of Evolution”
Theodosius Dobzhansky
human
human
chimp
macaque
chimp
mouse
mouse
rat
rat
cow
cow
dog
opossum
dog
platypus
platypus
chicken
chicken
zfish
zfish
tetra
tetra
fugu
fugu
macaque
opossum
t
http://cs173.stanford.edu [BejeranoWinter12/13]
23
The genome is constantly replicated
Every cell holds 2 copies of all its DNA = its genome.
The human body is made of ~1013 cells.
All originate from a single cell through repeated cell divisions.
DNA strings =
Chromosomes
egg
egg
cell
genome =
all DNA
cell
division
chicken
chicken ≈ 1013 copies
(DNA) of egg (DNA)
http://cs173.stanford.edu [BejeranoWinter12/13]
egg
24
Evolution = Mutation + Selection
Mistakes can happen during DNA replication. Mistakes are
oblivious to DNA segment function. But then selection kicks in.
junk
functional
...ACGTACGACTGACTAGCATCGACTACGA...
chicken
egg
TT
CAT
...ACGTACGACTGACTAGCATCGACTACGA...
“anything
goes”
many changes
are not tolerated
chicken
This has bad implications – disease,
and good implications – adaptation.
http://cs173.stanford.edu [BejeranoWinter12/13]
25
Mutation
http://cs173.stanford.edu [BejeranoWinter12/13]
26
Chromosomal (ie big)
Mutations
• Five types exist:
– Deletion
– Inversion
– Duplication
– Translocation
– Nondisjunction
Deletion
• Due to breakage
• A piece of a
chromosome is lost
Inversion
• Chromosome segment
breaks off
• Segment flips around
backwards
• Segment reattaches
Duplication
• Occurs when a
genomic region is
repeated
Whole Genome Duplication at the Base of the Vertebrate Tree
Xen.Laevis WGD
http://cs173.stanford.edu [BejeranoWinter12/13]
31
Translocation
• Involves two
chromosomes that
aren’t homologous
• Part of one
chromosome is
transferred to
another chromosomes
Nondisjunction
• Failure of chromosomes to separate
during meiosis
• Causes gamete to have too many or
too few chromosomes
• Disorders:
– Down Syndrome – three 21st chromosomes
– Turner Syndrome – single X chromosome
– Klinefelter’s Syndrome – XXY chromosomes
Genomic (ie small)
Mutations
• Six types exist:
– Substitution (eg GT)
– Deletion
– Insertion
– Inversion
– Duplication
– Translocation
Number of events
Example: Human-Chimp Genomic Differences
35
Inferring Genomic Mutations
From Alignments of Genomes
http://cs173.stanford.edu [BejeranoWinter12/13]
36
A Gene tree evolves with respect to
a Species tree
By “Gene” we mean
any piece of DNA.
Gene tree
Species tree
Speciation
Duplication
Loss
37
Terminology
Orthologs : Genes related via speciation (e.g. C,M,H3)
Paralogs: Genes related through duplication (e.g. H1,H2,H3)
Homologs: Genes that share a common origin
(e.g. C,M,H1,H2,H3)
Gene tree
single
ancestral
gene
Species tree
Speciation
Duplication
Loss
http://cs173.stanford.edu [BejeranoWinter12/13]
38
Typical Molecular Distances
If they were only evolving neutrally:
• To which is H1 closer in sequence, H2 or H3?
• To which H is M closest?
• And C?
(Selection may change distances)
Gene tree
single
ancestral
gene
Species tree
Speciation
Duplication
Loss
http://cs173.stanford.edu [BejeranoWinter12/13]
39
Gene trees and even species trees are
figments of our (scientific) imagination
Species trees and gene trees can be wrong.
All we really have are extant observations, and fossils.
Inferred
Observed
Gene tree
single
ancestral
gene
Species tree
Speciation
Duplication
Loss
http://cs173.stanford.edu [BejeranoWinter12/13]
40
Gene Families
41
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Given two strings
x = x1x2...xM,
y = y1y2…yN,
an alignment is an assignment of gaps to positions
0,…, N in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gap
in the other sequence
Scoring Function
Alternative definition:
• Sequence edits:
AGGCCTC
 Mutations
AGGACTC
 Insertions
AGGGCCTC
 Deletions
AGG . CTC
Scoring Function:
Match:
+m
Mismatch: -s
Gap:
-d
minimal edit distance
“Given two strings x, y,
find minimum # of edits
(insertions, deletions,
mutations) to transform
one string to the other”
Cost of edit operations
needs to be biologically
inspired (eg DEL length).
Solve via Dynamic Programming
Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d
Are two sequences homologous?
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Given an (optimal) alignment between two genome regions,
you can ask what is the probability that they are (not) related
by homology?
Note that (when known) the answer is a function of the
molecular distance between the two (eg, between two species)
DP matrix:
Chaining Alignments
Chaining highlights homologous regions between genomes (it bridges
the gulf between syntenic blocks and base-by-base alignments.
Local alignments tend to break at transposon insertions, inversions,
duplications, etc.
Global alignments tend to force non-homologous bases to align.
Chaining is a rigorous way of joining together local alignments into
larger structures.
DP matrix:
dot plots:
http://cs173.stanford.edu [BejeranoWinter12/13]
45
“Raw” (B)lastz track (no longer displayed)
Alignment = homologous regions
Protease Regulatory Subunit 3
46
Chains & Nets: How they’re built
• 1: Blastz one genome to another
– Local alignment algorithm
– Finds short blocks of similarity
Hg18:
Mm8:
AAAAAACCCCCAAAAA
AAAAAAGGGGG
Hg18.1-6 + AAAAAA
Mm8.1-6 + AAAAAA
Hg18.7-11 + CCCCC
Mm8.1-5 - CCCCC
Hg18.12-16 + AAAAA
Mm8.1-5 + AAAAA
47
Chains & Nets: How they’re built
• 2: “Chain” alignment blocks together
– Links blocks that preserve order and orientation
– Not single coverage in either species
Hg18:
Mm8:
AAAAAACCCCCAAAAA
AAAAAAGGGGGAAAAA
Hg18: AAAAAACCCCCAAAAA
Mm8.1-6 +
Mm8.12-16 +
Mm8
Mm8.7-11 chains Mm8.12-15 +
Mm8.1-5 +
48
Another Chain Example
Ancestral Sequence
A
B
C
D
E
Human Sequence
A
B
C
D
E
Mouse Sequence
A
B
C
B’
D
E
In Human Browser
Implicit
Human
sequence
Mouse
chains
B’
…
D
…
D
In Mouse Browser
E
E
Implicit
Mouse
sequence
Human
chains
…
…
D
E
49
The Use of an Outgroup
Outgroup Sequence
A
B
C
Mouse Sequence
D
E
A
B
C
B’
D
E
Human Sequence
A
B
C
D
E
In Human Browser
Implicit
Human
sequence
Mouse
chains
B’
…
D
…
D
In Mouse Browser
E
E
Implicit
Mouse
sequence
Human
chains
…
…
D
E
50
Chains join together related local alignments
likely ortholog
likely paralogs
shared domain?
Protease Regulatory Subunit 3
http://cs173.stanford.edu [BejeranoWinter12/13]
51
Chains
• a chain is a sequence of gapless aligned blocks, where there must be
no overlaps of blocks' target or query coords within the chain.
• Within a chain, target and query coords are monotonically nondecreasing. (i.e. always increasing or flat)
• double-sided gaps are a new capability (blastz can't do that) that
allow extremely long chains to be constructed.
• not just orthologs, but paralogs too, can result in good chains. but
that's useful!
• chains should be symmetrical -- e.g. swap human-mouse -> mousehuman chains, and you should get approx. the same chains as if you
chain swapped mouse-human blastz alignments.
• chained blastz alignments are not single-coverage in either target or
query unless some subsequent filtering (like netting) is done.
• chain tracks can contain massive pileups when a piece of the target
aligns well to many places in the query. Common causes of this
include insufficient masking of repeats and high-copy-number genes
(or paralogs).
[Angie Hinrichs, UCSC wiki]
http://cs173.stanford.edu [BejeranoWinter12/13]
52
Before and After Chaining
http://cs173.stanford.edu [BejeranoWinter12/13]
53
Chaining Algorithm
Input - blocks of gapless alignments from blastz
Dynamic program based on the recurrence relationship:
score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))
j<i
Uses Miller’s KD-tree algorithm to minimize which parts of dynamic
programming graph to traverse. Timing is O(N logN), where N is
number of blocks (which is in hundreds of thousands)
http://cs173.stanford.edu [BejeranoWinter12/13]
54
Download