Stochastic Models for Sequence Pattern Discovery Mayetri Gupta Department of Statistics

advertisement
Stochastic Models for Sequence
Pattern Discovery
Mayetri Gupta
Department of Statistics
Harvard University
Richard Bayes (1596-1675), a great-grandfather of Thomas Bayes, was a
successful cutler in Sheffield. In 1643 Richard Bayes served in the rotating
position of Master of the Company of Cutlers of Hallamshire. Richard was
sufficiently well off that he sent one of his sons, Samuel Bayes (1635-1681) to
Trinity College Cambridge during the Commonwealth period; Samuel obtained
his degree in 1656. Another son, Joshua Bayes (1638-1703) followed in his
father’s footsteps in the cutlery industry, also serving as Master of the
Company in 1679. Evidence of Joshua Bayes’s wealth comes from the size of
his house, the fact that he employed a servant and the size of the taxes that he
paid. His influence may be taken from his activities in the town government.
Following the 1662 Act of Uniformity, Samuel Bayes was ejected from his
parish, eventually living in Manchester (Matthews, 1934). Joshua Bayes was
closely involved in the erection of one Nonconformist chapel in Sheffield and
had two sons-in-law involved in another Sheffield chapel. The second son of
Joshua Bayes (1638-1703) was another Joshua Bayes (1671-1746). In 1686 the
younger Joshua Bayes entered a dissenting academy where he studied
philosophy and divinity. Joshua Bayes and his wife Anne née Carpenter were
married some time, probably within days, after their marriage license was
issued on October 23, 1700. Joshua and Anne Bayes had seven children and
Thomas was the eldest.
Ref: Thomas Bayes- a Biography to Celebrate the Tercentenary of his Birth by Dr. D. R. Bellhouse
Richard Bayes (1596-1675), a great-grandfather of Thomas Bayes, was a
successful cutler in Sheffield. In 1643 Richard Bayes served in the rotating
position of Master of the Company of Cutlers of Hallamshire. Richard was
sufficiently well off that he sent one of his sons, Samuel Bayes (1635-1681) to
Trinity College Cambridge during the Commonwealth period; Samuel obtained
his degree in 1656. Another son, Joshua Bayes (1638-1703) followed in his
father’s footsteps in the cutlery industry, also serving as Master of the
Company in 1679. Evidence of Joshua Bayes’s wealth comes from the size of
his house, the fact that he employed a servant and the size of the taxes that he
paid. His influence may be taken from his activities in the town government.
Following the 1662 Act of Uniformity, Samuel Bayes was ejected from his
parish, eventually living in Manchester (Matthews, 1934). Joshua Bayes was
closely involved in the erection of one Nonconformist chapel in Sheffield and
had two sons-in-law involved in another Sheffield chapel. The second son of
Joshua Bayes (1638-1703) was another Joshua Bayes (1671-1746). In 1686 the
younger Joshua Bayes entered a dissenting academy where he studied
philosophy and divinity. Joshua Bayes and his wife Anne née Carpenter were
married some time, probably within days, after their marriage license was
issued on October 23, 1700. Joshua and Anne Bayes had seven children and
Thomas was the eldest.
Richard Bayes (1596-1675), a great-grandfather of Thomas Bayes, was a
successful cutler in Sheffield. In 1643 Richard Bayes served in the rotating
position of Master of the Company of Cutlers of Hallamshire. Richard was
sufficiently well off that he sent one of his sons, Samuel Bayes (1635-1681) to
Trinity College Cambridge during the Commonwealth period; Samuel obtained
his degree in 1656. Another son, Joshua Bayes (1638-1703) followed in his
father’s footsteps in the cutlery industry, also serving as Master of the
Company in 1679. Evidence of Joshua Bayes’s wealth comes from the size of
his house, the fact that he employed a servant and the size of the taxes that he
paid. His influence may be taken from his activities in the town government.
Following the 1662 Act of Uniformity, Samuel Bayes was ejected from his
parish, eventually living in Manchester (Matthews, 1934). Joshua Bayes was
closely involved in the erection of one Nonconformist chapel in Sheffield and
had two sons-in-law involved in another Sheffield chapel. The second son of
Joshua Bayes (1638-1703) was another Joshua Bayes (1671-1746). In 1686 the
younger Joshua Bayes entered a dissenting academy where he studied
philosophy and divinity. Joshua Bayes and his wife Anne née Carpenter were
married some time, probably within days, after their marriage license was
issued on October 23, 1700. Joshua and Anne Bayes had seven children and
Thomas was the eldest.
after removing punctuation....
richardbayesagreatgrandfatherofthomasbayeswasasuccessfulcutlerin
sheffieldinrichardbayesservedintherotatingpositionofmasterofthecompa
nyofcutlersofhallamshirerichardwassufficientlywelloffthathesentoneofhi
ssonssamuelbayestotrinitycollegecambridgeduringthecommonwealthp
eriodsamuelobtainedhisdegreeinanothersonjoshuabayesfollowedinhisf
athersfootstepsinthecutleryindustryalsoservingasmasterofthecompanyi
nevidenceofjoshuabayesswealthcomesfromthesizeofhishousethefactth
atheemployedaservantandthesizeofthetaxesthathepaidhisinfluencema
ybetakenfromhisactivitiesinthetowngovernmentfollowingtheactofunifor
mitysamuelbayeswasejectedfromhisparisheventuallylivinginmancheste
rmatthewsjoshuabayeswascloselyinvolvedintheerectionofonenonconfo
rmistchapelinsheffieldandhadtwosonsinlawinvolvedinanothersheffieldc
hapelthesecondsonofjoshuabayeswasanotherjoshuabayesintheyounge
rjoshuabayesenteredadissentingacademywherehestudiedphilosophya
nddivinityjoshuabayesandhiswifeanneneecarpenterweremarriedsometi
meprobablywithindaysaftertheirmarriagelicensewasissuedonoctober
joshuaandannebayeshadsevenchildrenandthomaswastheeldest
after removing punctuation....
richardbayesagreatgrandfatherofthomasbayeswasasuccessfulcutlerin
sheffieldinrichardbayesservedintherotatingpositionofmasterofthecompa
nyofcutlersofhallamshirerichardwassufficientlywelloffthathesentoneofhi
ssonssamuelbayestotrinitycollegecambridgeduringthecommonwealthp
eriodsamuelobtainedhisdegreeinanothersonjoshuabayesfollowedinhisf
athersfootstepsinthecutleryindustryalsoservingasmasterofthecompanyi
nevidenceofjoshuabayesswealthcomesfromthesizeofhishousethefactth
atheemployedaservantandthesizeofthetaxesthathepaidhisinfluencema
ybetakenfromhisactivitiesinthetowngovernmentfollowingtheactofunifor
mitysamuelbayeswasejectedfromhisparisheventuallylivinginmancheste
rmatthewsjoshuabayeswascloselyinvolvedintheerectionofonenonconfo
rmistchapelinsheffieldandhadtwosonsinlawinvolvedinanothersheffieldc
hapelthesecondsonofjoshuabayeswasanotherjoshuabayesintheyounge
rjoshuabayesenteredadissentingacademywherehestudiedphilosophya
nddivinityjoshuabayesandhiswifeanneneecarpenterweremarriedsometi
meprobablywithindaysaftertheirmarriagelicensewasissuedonoctober
joshuaandannebayeshadsevenchildrenandthomaswastheeldest
with an alphabet of 4 letters....
richardbayesagreatgrandfatherofthomasbayeswasasuccessfulcutlerin
sheffieldinrichardbayesservedintherotatingpositionofmasterofthecompa
nyofcutlersofhallamshirerichardwassufficientlywelloffthathesentoneofhi
ssonssamuelbayestotrinitycollegecambridgeduringthecommonwealthp
eriodsamuelobtainedhisdegreeinanothersonjoshuabayesfollowedinhisf
athersfootstepsinthecutleryindustryalsoservingasmasterofthecompanyi
nevidenceofjoshuabayesswealthcomesfromthesizeofhishousethefactth
atheemployedaservantandthesizeofthetaxesthathepaidhisinfluencema
ybetakenfromhisactivitiesinthetowngovernmentfollowingtheactofunifor
mitysamuelbayeswasejectedfromhisparisheventuallylivinginmancheste
rmatthewsjoshuabayeswascloselyinvolvedintheerectionofonenonconfo
rmistchapelinsheffieldandhadtwosonsinlawinvolvedinanothersheffieldc
hapelthesecondsonofjoshuabayeswasanotherjoshuabayesintheyounge
rjoshuabayesenteredadissentingacademywherehestudiedphilosophya
nddivinityjoshuabayesandhiswifeanneneecarpenterweremarriedsometi
meprobablywithindaysaftertheirmarriagelicensewasissuedonoctober
joshuaandannebayeshadsevenchildrenandthomaswastheeldest
abcdefg
a
hijklmn
c
opqrstu
g
vwxyz
t
with an alphabet of 4 letters....
gcacagaaatagaagaagagacaaagcaggagcgcagaatagtagaggaaaggagcaggcagcc
gcaaacacaccgcacagaaataggagtaaccgcagggagccagggcgcgcgacaggaggagcaagcga
ctgaaggcagggacaccacgccgagcacagatagggaacacacgcttaccgaagcagcagacggcagacc
gggcggacgacaatagggggcccgtagccaaaaacagcaaaaggccagcaagccgctaacgcg
agcgagacgacgagaccaaccgaaagaaccacggcagggccggcgaaatagagccgtaaccccga
agcaggagggggaggccgcaaggcagtccaggggtacgggagtccaagcaggaggagcaagcgactc
catcaacaagacggcgaaataggtaacgcagcagaggcgcagctagaccgcgggagcaaaaggc
agcaacgcgtaaagagtacgacagcagctagagcagataggcagcagacaccgccacgacaaca
taagacacaggcccgaagctcgcagccgcaggtcagtagccacgagccgtccagcaaaggagccagg
ccgtgacgacaatagtagacaagaaaggcccggagcgcatacggacctcctccacccacacagga
gcaggcatgcggcgaaatagtagacggactcctgctaaccgcaagaagcgcgagcacgcagcag
gccggacagacccgcaaacacaacacaagtgggcgcccatcctgctaaccacggcaggcaaacacaa
cagacgcagaagcaggcgacggcgaaatagtagacggcagcggcgaaatagccgcatggcaa
gcggcgaaatagacgagaaaacggacgccaaaaaacttcagacagggacaagcccggggcta
caactcccgtcggcgaaatagacaccgtcaaaccacaaaaggacgagtagacaggcaaggcagc
cagggaaacttcgcccaatgaagaggcacgcaggcaaaccaacgatagcgggaagcgaggaag
cggcgaacaaccaaatagcaagatacacccagacacagcgcagtaggcaacaagg
abcdefg
a
hijklmn
c
opqrstu
g
vwxyz
t
with an alphabet of 4 letters....
gcacagaaatagaagaagagacaaagcaggagcgcagaatagtagaggaaaggagcaggcagcc
gcaaacacaccgcacagaaataggagtaaccgcagggagccagggcgcgcgacaggaggagcaagcga
ctgaaggcagggacaccacgccgagcacagatagggaacacacgcttaccgaagcagcagacggcagacc
gggcggacgacaatagggggcccgtagccaaaaacagcaaaaggccagcaagccgctaacgcg
agcgagacgacgagaccaaccgaaagaaccacggcagggccggcgaaatagagccgtaaccccga
agcaggagggggaggccgcaaggcagtccaggggtacgggagtccaagcaggaggagcaagcgactc
catcaacaagacggcgaaataggtaacgcagcagaggcgcagctagaccgcgggagcaaaaggc
agcaacgcgtaaagagtacgacagcagctagagcagataggcagcagacaccgccacgacaaca
taagacacaggcccgaagctcgcagccgcaggtcagtagccacgagccgtccagcaaaggagccagg
ccgtgacgacaatagtagacaagaaaggcccggagcgcatacggacctcctccacccacacagga
gcaggcatgcggcgaaatagtagacggactcctgctaaccgcaagaagcgcgagcacgcagcag
gccggacagacccgcaaacacaacacaagtgggcgcccatcctgctaaccacggcaggcaaacacaa
cagacgcagaagcaggcgacggcgaaatagtagacggcagcggcgaaatagccgcatggcaa
gcggcgaaatagacgagaaaacggacgccaaaaaacttcagacagggacaagcccggggcta
caactcccgtcggcgaaatagacaccgtcaaaccacaaaaggacgagtagacaggcaaggcagc
cagggaaacttcgcccaatgaagaggcacgcaggcaaaccaacgatagcgggaagcgaggaag
cggcgaacaaccaaatagcaagatacacccagacacagcgcagtaggcaacaagg
joshua
aatag
cggcga
bayes
with an alphabet of 4 letters....
gcacagaaatagaagaagagacaaagcaggagcgcagaatagtagaggaaaggagcaggcagcc
gcaaacacaccgcacagaaataggagtaaccgcagggagccagggcgcgcgacaggaggagcaagcga
ctgaaggcagggacaccacgccgagcacagatagggaacacacgcttaccgaagcagcagacggcagacc
gggcggacgacaatagggggcccgtagccaaaaacagcaaaaggccagcaagccgctaacgcg
agcgagacgacgagaccaaccgaaagaaccacggcagggccggcgaaatagagccgtaaccccga
agcaggagggggaggccgcaaggcagtccaggggtacgggagtccaagcaggaggagcaagcgactc
catcaacaagacggcgaaataggtaacgcagcagaggcgcagctagaccgcgggagcaaaaggc
agcaacgcgtaaagagtacgacagcagctagagcagataggcagcagacaccgccacgacaaca
taagacacaggcccgaagctcgcagccgcaggtcagtagccacgagccgtccagcaaaggagccagg
ccgtgacgacaatagtagacaagaaaggcccggagcgcatacggacctcctccacccacacagga
gcaggcatgcggcgaaatagtagacggactcctgctaaccgcaagaagcgcgagcacgcagcag
gccggacagacccgcaaacacaacacaagtgggcgcccatcctgctaaccacggcaggcaaacacaa
cagacgcagaagcaggcgacggcgaaatagtagacggcagcggcgaaatagccgcatggcaa
gcggcgaaatagacgagaaaacggacgccaaaaaacttcagacagggacaagcccggggcta
caactcccgtcggcgaaatagacaccgtcaaaccacaaaaggacgagtagacaggcaaggcagc
cagggaaacttcgcccaatgaagaggcacgcaggcaaaccaacgatagcgggaagcgaggaag
cggcgaacaaccaaatagcaagatacacccagacacagcgcagtaggcaacaagg
joshua
aatag
cggcga
bayes
with an alphabet of 4 letters....
gcacagaaatagaagaagagacaaagcaggagcgcagaatagtagaggaaaggagcaggcagcc
gcaaacacaccgcacagaaataggagtaaccgcagggagccagggcgcgcgacaggaggagcaagcga
ctgaaggcagggacaccacgccgagcacagatagggaacacacgcttaccgaagcagcagacggcagacc
gggcggacgacaatagggggcccgtagccaaaaacagcaaaaggccagcaagccgctaacg “rdwas”
agcgagacgacgagaccaaccgaaagaaccacggcagggccggcgaaatagagccgtaaccccga
agcaggagggggaggccgcaaggcagtccaggggtacgggagtccaagcaggaggagcaagcgactc
catcaacaagacggcgaaataggtaacgcagcagaggcgcagctagaccgcgggagcaaaaggc agcaacgcgtaaagagtacgacagcagctagagcagataggcagcagacaccgccacgacaaca “taxes”
taagacacaggcccgaagctcgcagccgcaggtcagtagccacgagccgtccagcaaaggagccagg
ccgtgacgacaatagtagacaagaaaggcccggagcgcatacggacctcctccacccacacagga
gcaggcatgcggcgaaatagtagacggactcctgctaaccgcaagaagcgcgagcacgcagcag
gccggacagacccgcaaacacaacacaagtgggcgcccatcctgctaaccacggcaggcaaacacaa
cagacgcagaagcaggcgacggcgaaatagtagacggcagcggcgaaatagccgcatggcaa
gcggcgaaatagacgagaaaacggacgccaaaaaacttcagacagggacaagcccggggcta
caactcccgtcggcgaaatagacaccgtcaaaccacaaaaggacgagtagacaggcaaggcagc
cagggaaacttcgcccaatgaagaggcacgcaggcaaaccaacgatagcgggaagcgaggaag
cggcgaacaaccaaatagcaagatacacccagacacagcgcagtaggcaacaagg “sewas”
joshua
aatag
cggcga
bayes
taatgtttgtgctggtttttgtggcatcgggcgagaatagcgcgtggtgtgaaagactgttttt
ttgatcgttttcacaaaaatggaagtccacagtcttgacaggacaaaaacgcgtaacaaaagtg
tctataatcacggcagaaaagtccacattgattatttgcacggcgtcacactttgctatgccat
agcatttttatccataagacaaatcccaataacttaattattgggatttgttatatataacttt
ataaattcctaaaattacacaaagttaataactgtgagcatggtcatatttttatcaatcacaa
agcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatgtatgca
aaggacgtcacattaccgtgcagtacagttgatagcacggtgctacacttgtatgtagcgcatc
tttctttacggtcaatcagcatggtgttaaattgatcacgttttagaccattttttcgtcgtga
aactaaaaaaaccagtgaattatttgaaccagatcgcattacagtgatgcaaacttgtaagtag
atttccttaattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaatagcgcataaaa
aacggctaaattcttgtgtaaacgattccactaatttattccatgtcacacttttcgcatcttt
gttatgctatggttatttcataccataagccgctccggcggggttttttgttatctgcaattca
gtacaaaacgtgatcaacccctcaattttccctttgctgaaaaattttccattgtctcccctgt
aaagctgtaacgcaattaatgtgagttagctcactcattaggcaccccaggctttacactttat
gcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacattaccgccaatt
ctgtaacagagatcacacaaagcgacggtggggcgtaggggcaaggaggatggaaagaggttgc
cgtataaagaaactagagtccgtttaggaggaggcgggaggatgagaacacggcttctgtgaac
taaaccgaggtcatgtaaggaatttcgtgatgttgcttgcaaaaatcgtggcgattttatgtgc
gcagatcagcgtcgttttaggtgagttgttaataaagatttggaattgtgacacagtgcaaatt
cagacacataaaaaaacgtcatcgcttgcattagaaaggtttctgctgacaaaaaagattaaac
ataccttatacaagacttttttttcatatgcctgacggagttcacacttgtaagttttcaacta
cgttgtagactttacatcgccttttttaaacattaaaattcttacgtaatttataatctttaaa
aaaagcatttaatattgctccccgaacgattgtgattcgattcacatttaaacaatttcagacc
catgagagtgaaattgttgtgatgtggttaacccaattagaattcgggattgacatgtctta
Upstream sequences in E. coli
taatgtttgtgctggtttttgtggcatcgggcgagaatagcgcgtggtgtgaaagactgttttt
ttgatcgttttcacaaaaatggaagtccacagtcttgacaggacaaaaacgcgtaacaaaagtg
tctataatcacggcagaaaagtccacattgattatttgcacggcgtcacactttgctatgccat
agcatttttatccataagacaaatcccaataacttaattattgggatttgttatatataacttt
ataaattcctaaaattacacaaagttaataactgtgagcatggtcatatttttatcaatcacaa
agcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatgtatgca
aaggacgtcacattaccgtgcagtacagttgatagcacggtgctacacttgtatgtagcgcatc
tttctttacggtcaatcagcatggtgttaaattgatcacgttttagaccattttttcgtcgtga
aactaaaaaaaccagtgaattatttgaaccagatcgcattacagtgatgcaaacttgtaagtag
atttccttaattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaatagcgcataaaa
aacggctaaattcttgtgtaaacgattccactaatttattccatgtcacacttttcgcatcttt
gttatgctatggttatttcataccataagccgctccggcggggttttttgttatctgcaattca
gtacaaaacgtgatcaacccctcaattttccctttgctgaaaaattttccattgtctcccctgt
aaagctgtaacgcaattaatgtgagttagctcactcattaggcaccccaggctttacactttat
gcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacattaccgccaatt
ctgtaacagagatcacacaaagcgacggtggggcgtaggggcaaggaggatggaaagaggttgc
cgtataaagaaactagagtccgtttaggaggaggcgggaggatgagaacacggcttctgtgaac
taaaccgaggtcatgtaaggaatttcgtgatgttgcttgcaaaaatcgtggcgattttatgtgc
gcagatcagcgtcgttttaggtgagttgttaataaagatttggaattgtgacacagtgcaaatt
cagacacataaaaaaacgtcatcgcttgcattagaaaggtttctgctgacaaaaaagattaaac
ataccttatacaagacttttttttcatatgcctgacggagttcacacttgtaagttttcaacta
cgttgtagactttacatcgccttttttaaacattaaaattcttacgtaatttataatctttaaa
aaaagcatttaatattgctccccgaacgattgtgattcgattcacatttaaacaatttcagacc
catgagagtgaaattgttgtgatgtggttaacccaattagaattcgggattgacatgtctta
Biological Motivation
Transcription regulation: DNA strand
DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC..
Alphabet of 4 letters: A, C, G, T
Biological Motivation
Transcription factor recognizes binding site (motif)
TRANSCRIPTION
FACTOR
DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC..
Biological Motivation
TF attracts RNA polymerase to promoter
TRANSCRIPTION
FACTOR
RNA
POLYMERASE
DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC..
Biological Motivation
Transcription
mRNA copied from DNA
TRANSCRIPTION
FACTOR
RNA
POLYMERASE
DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC..
TRANSCRIPTION
mRNA
Biological Motivation
mRNA
PROTEINS
TRANSCRIPTION
FACTOR
RNA
POLYMERASE
DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC..
TRANSCRIPTION
mRNA
PROTEINS
Transcription Regulation
Discovery of transcription factor binding sites
TRANSCRIPTION
FACTOR
RNA
POLYMERASE
DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC..
TRANSCRIPTION
mRNA
PROTEINS
Motif Representation
Position
matrix Θ
0
|TATAAT|
G
1
2
3
4
5
6
bits
“information content”
1
G
2
CT
weight
Sequence logo
Kullback-Leibler
specific
00
00
00
10
08
0 15
0 05
00
09
00
00
01
00
00
04
06
10
00
00
00
00
00
00
10
A
C
G
T
TAGAAT
TATACT
TATTAT
TAGAAT
TATAAT
TATAAT
TAGACT
TATAAT
TATACT
TATAAT
TATAAT
TATAAT
TAGAGT
TAGAAT
TATAAT
TAGAAT
TAGTAT
TATAAT
TATAAT
TAGAAT
Computational Approaches to Motif
Discovery
Pattern Search Methods
word
frequency
based
EM (MEME, )
Gibbs sampling
(Motif Sampler,
AlignAce,
BioProspector)
enumerate,
initialize,
check
refine
significance using data
Consensus
Dictionary
MDscan,
weight
matrix
based
A Word Frequency Based Approach
Consensus (Stormo and Hartzell, 1989)
Width = 5
A
0
0
0
0
G
0
0
0
1
0
T
1
1
0
0
0
1
A
0
0
1
1
C
0
0
0
1
G
0
0
1
0
0
T
1
0
0
0
0
0
C
0
0
1
0
0
G
0
1
0
0
1
T
0
0
0
0
0
tgctaatct
0
A
0
0
0
1
0
0
0
C
1
0
0
0
0
G
1
0
0
1
0
G
0
0
1
0
0
T
0
0
0
0
1
T
0
0
0
1
1
C
0
0
1
1
0
0
0
A
0
Idea: maximize entropy distance between motif and background
0
0
C
1
0
0
1
1
0
0
0
ctaattagc
A
ttagcagtt
Find all matrices from 1st sequence
A Word Frequency Based Approach
Consensus (Stormo and Hartzell, 1989)
Width = 5
0
0
0
G
0
0
0
1
0
T
1
1
0
0
0
1
C
0
0
0
1
G
0
0
1
0
0
T
1
0
0
0
0
0
A
0
0
G
1
0
0
1
0
T
0
0
0
0
1
1
0
C
0
0
1
0
0
G
0
1
0
0
1
T
0
0
0
0
0
1
0
0
0
0
C
1
0
0
0
0
G
0
0
1
0
0
T
0
0
0
1
1
1
0
0
0
0
C
1
0
1
A
0
0
1
Update each matrix by “best” match in 2nd sequence, keep all matrices that score above a
cut-off level. Repeat for rest of the sequences
A
0
0
0
G
0
0
0
2
0
T
2
2
0
0
0
A
2
1
0
2
C
0
0
1
0
0
C
1
0
0
0
0
G
0
1
0
0
2
G
0
0
1
0
0
T
0
1
1
0
0
T
0
0
0
2
2
Initial
0
sequences
must contain motif
0
0
2
0
C
1
0
0
2
0
0
0
2
A
0
Repeat, randomizing
sequence order?
0
0
1
0
A
A
C
tgctaatct
0
0
1
0
0
ctaattagc
0
A
ttagcagtt
Find all matrices from 1st sequence
Weight Matrix-based Approaches
Motif width w assumed known.
EM for finite mixture models (Lawrence and Reilly, 1990)
1 motif per sequence
MEME (Bailey and Elkan, 1994)
Data set broken into overlapping subsequences of length w
Fit a two-component (motif / background) mixture model
iteratively by EM algorithm.
motifs must be same width; number of different motifs
(components) may be unknown
Monte Carlo-based Approaches
Gibbs Motif Sampler (Liu et al, JASA, 95)
Weight matrix Θ (parameter), start positions A (“missing” data)
Predictive updating – Assume N
1 sequences aligned;
1 motif per sequence (originally)
stochastically predict the N th one
A1
Randomly initialize Ai ’s
A2
A3
Iteratively, choose a sequence k
to exclude, update Ak
repeat until convergence
A N = ??
Extensions of Gibbs Sampling
AlignAce (Roth et al, 1998)
Iterative masking to find multiple motifs
Variable widths of motifs
BioProspector (X. Liu et al, 2001)
3rd order Markovian background assumption
A modified “scoring” function
Allows 2-block (gapped) motifs
“Dictionary” Model (Bussemaker et al. 2000)
C C C
C C C
C
A
CC
CC C C
A
C
C
C
C C C
A A A
ACTTA
A
C
A A AA
TAT
A A A AA
TAT
G
TAT
A A
G
G G
G G G
T T
T
G
T
TAT G G
G
T T T T
G
G
T T TT T TT
G
T T T T ACTTA
G
G
T
ACTTA
A A TGC G TAT
TG T T T TA C C C A GT T T ...
A, C, G, T and longer words
are part of a dictionary.
Words drawn independently
according to certain probabilities and concatenated together to form the sequence.
“Dictionary” Model (Bussemaker,et al. 2000)
A, C, G, T
Algorithm: Starting dictionary : D 0
Find over-represented pairs (adjacent words)
Concatenate and add them to the dictionary
Estimate word probabilities
Problems: upper limit to length of patterns; fuzzy patterns;
patterns with under-represented substrings
A, C, G, T, AC, GC, TA, ATA, GCTA, TAGCTA
D4
Words
Stochastic Word Matrices
T
T
G
A
C
A
Words
Stochastic Word Matrices
T
A
T
G
A
C
A
0
0
0
1
0
1
0
0
0
0
1
0
0
0
1
0
0
0
1
1
0
0
0
0
G
C
T
!
Words
Stochastic Word Matrices
A
T
0
.0
.95
0
.9
.0
0
.1
.05
.99
.0
.0
0
.8
.0
.01
.0
.8
1
.1
.0
0
.1
G
C
.2
!
TTGACA [PROB = 0.5417]
ATGACT [PROB = 0.0150]
TTTAGT [PROB = 7.6e-05]
Stochastic Dictionary
Framework
Stochastic Dictionary
MD
"
"
"
ρ MD
"
"
xN
"
"
"
x 1 x2
Sequence S
ρ M1
Word usage probabilities: ρ
M 1 M2
Dictionary of D word matrices D
Stochastic Dictionary
"
ρ MD
"
"
ρ M1
MD
"
Word usage probabilities: ρ
"
M 1 M2
Dictionary of D word matrices D
Under independence, observed likelihood of sequence S is
&
%
Nj Π
∑ ∏ ρ Mj
#
PS ρ
D
$
Π j 1
summed over all partitions Π
Nj Π
frequency for word type j in partition Π.
PROBLEM: partitions, words, number of words unknown!
Missing Data– Sequence Partitions
4
23
4: 2 words; w
A23 2
1
6: 1 word
Site indicators unknown for motif type i
if motif type k starts at i,
0
otherwise.
(
1
"
Aik
1
'
A14 1
w
1
'
'
A4 1
AAATATTCGACGCTATTTCCCGTTGACAT
Sequence S
14
"
)
Model Parameters
4 single letters and stochastic
Dictionary of size D contains d
word matrices
ΘD
"
-
D
d 1
1
.
"
"
(θi j is a vector of length 4)
0
|TATAAT|
G
1
2
3
4
5
6
'
,
i
bits
θi wi
G
2
CT
*
2
Θd
"
*
+
width wi
θi1 θi2
Θi
1
Θd
Θ
Word usage probabilities
"
ρΘ
For conciseness, O
"
"
ρD
"
ρ 1 ρ2
ρ
Model Parameters
4 single letters and stochastic
Dictionary of size D contains d
word matrices
"
*
ΘD
bits
.
"
"
Priors: Columns are independent Dirichlet
0
|TATAAT|
G
1
2
3
4
5
6
D
Word usage probabilities
ρD
"
"
"
ρ 1 ρ2
ρ
G
-
1
CT
'
,
d 1
2
θi wi
θi1 θi2
i
2
Θd
"
*
width wi
+
Θi
1
Θd
Θ
Prior: Dirichlet
Stochastic Dictionary-based Data
Augmentation (SDDA)
S
Easy
"
"
Sample Θ ρ A
#
– Word counts multinomial, Dirichlet prior on ρ
"
ρ A S Dirichlet
#
– Motif column counts independent multinomial, prior on
columns of Θ independent Dirichlet
"
Θ A S Product Dirichlet
#
Difficult!
"
S Θρ
"
Sample A
#
Need dynamic programming (forward summation-backward sampling)
Details: Discovery of conserved sequence patterns using a stochastic dictionary model. Gupta,
M and Liu, JS (2003). JASA 98, 55-66.
Forward Summation
Data Likelihood
Observed
currently D words in the dictionary
"
"
#
xk O
P x1
gk O
partial likelihood
richardbayesagreatgrandfatherofthomasbayeswas...
k = 45
PS O
#
"
gk O
/
Given gk 1
O
Then, gN
easy to calculate!
Forward Summation
Data Likelihood
Observed
Currently 3 words in the dictionary
richardbayesagreatgrandfatherofthomasbayeswas...
.
#
P WAS Θ1
01
g41 O
g45 O
Forward Summation
Data Likelihood
Observed
Currently 3 words in the dictionary
richardbayesagreatgrandfatherofthomasbayeswas...
P ESWAS Θ2
.
10
.
#
g40 O
#
P WAS Θ1
01
g41 O
g45 O
Forward Summation
Data Likelihood
Observed
Currently 3 words in the dictionary
richardbayesagreatgrandfatherofthomasbayeswas...
P ESWAS Θ2
01
P ASBAYESWAS Θ3
.
#
#
g35 O
.
10
g40 O
#
P WAS Θ1
01
g41 O
g45 O
Backward Sampling
Partitions
Currently 3 words in the dictionary
richardbayesagreatgrandfatherofthomasbayeswas...
#
"
"
"
#
P ASBAYESWAS Θ3
g45 O
10
P ‘‘ASBAYESWAS’’ S O A46
g35 O
Backward Sampling for Motif Sites
Faster Convergence than the Gibbs Sampler:
j
? k
32 32
32 32
32 32
23 23
.
Gibbs
Ak= 1
5
Aj = 1
Ai | A1 , ... A i−1 , Ai+1 , ... , AN , S
"local" moves
Sample A from
joint distribution
? k
SDDA
4
i k, motif can be at
i only if i j w
4
for j
–faster convergence than
the ordinary Gibbs sam-
Ai | Ai+1 , ... , AN , Θ, S
pler using conditionals
"global" moves
(Liu et al. Biometrika, 1994)
Extensions– Insertions and Deletions
in Motifs
GACAC TATTCC
CACA TATTCC
TCG TATTCC
TCTGT TATTCC
GCT TATTCC
TATGC GGC
ACCA GGC
GTGGTA GGC
ACCCA GGC
AAA
GGC
CTACGACCT
TGATCTCAGACAC
AGTTTCTCAAAC
GTAATTCGAGA
TTCTCGAGAAATC
Not all w columns of the motif may be contiguous
W=6+3=9
varying gap size
Discovery of Gapped Motifs
Hidden Markov model for gapped motifs
Dj
Ij
Mj
BEGIN
END
Each position of a segment can be generated from 1 of 3 possible
states: MATCH, INSERTION, DELETION
"
"
θ9
θ1
Θ
TATTCCAAAGGC ??
Discovery of Gapped Motifs
"
P TATTCCAAAGGC Θ path
#
∑
#
P TATTCCAAAGGC Θ
all paths through HMM
Multiple paths through HMM
θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9
to “align”
T
A
gapped segment of length l
M
T
T
with
D
C
matrix Θ (width w)
C
A
A
A
G
G
Approx. order of computation:
I
.
l
w
6
C
w
22w
wπ
Discovery of Gapped Motifs
"
P TATTCCAAAGGC Θ path
#
∑
#
P TATTCCAAAGGC Θ
all paths through HMM
θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9
T
A
Use recursive
forward summation
M
to compute likelihood and
T
T
D
C
C
backward sampling
to sample path through the HMM.
A
A
A
3
l
0
Order of computation
0
C
I
7
G
G
w
Motifs in “Low-complexity” Regions
Genomes of higher organisms often characterized by
polynucleotide repeats– local “traps”
Example
> gi| 200712| gb| M97810.1| MUSREX01 Mus musculus ZFP 42 (Rex- 1) gene
TCAGGCAACTAGTGTACTTTGTAGCGGGGTCCGGGAGAGGCTGGGGTCTAGAGTGGCGATGGGACGAAAGGGTAAA
AGTTTTCGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTAGGTAGGTAGGTAGGTAGAAAC
ATCCTCTGCTTGTGTAAATCCGGTTACTGTGTAACAGAGGTACTGAGATGTGACTGAGTCTCAAGGCCAGGCGATC
GGATTCAGAAGAGGCATTTGCATAACTGAGCAAGAGCCTTTGCCCCACCCTTCCACGCGGACCCAAGACAAGCGGG
TCTGGGTGGGTCACCTTGAAGCCAGGGGCCCGCCCACATCCCCGCCCACACCCACCTTGAGCGCTTCTCATTGGTT
CAGCCTACCCAGCTCGGAGTTAGTTACTCCGTAAGTGTGGCCGGAACAGAGTTCGTCCATCTA...
[mouse skeletal muscle regulatory region]
Compared performance of Stochastic Dictionary (SDDA) to
BioProspector and AlignAce
Simulation Study under Sequence
Background Correlation
Markov transition matrices used:
=
α
3α α
α
α 1
α
α
=
α
3α α
=
α 1
3α
@
<
2000
α
β
α
α
β
? ? ? ? ?
α
β
β
>
β
2α
1
=
2α
β
2α
=
=
2α
1
=
α
1
=
β
=
@
; ; ; ; ; = <
9
8
1
α
α
=
:
ii
α
highest
eigenvalue
Simulated data sets of
2-letter repeats
β
dependence
(EV2 )
? ? ? ? ?
=
9
8
; ; ; ; ;
α 1
α
2nd
>
:
i
3α α
of
represented by
1-letter repeats
1
Degree
nucleotides
each
containing 1 true motif,
w=16 and correlated background for 4 choices each
of matrix (i) and (ii).
Performance Comparison
SDDA
background
BioProspector
AlignAce
FP
Success
FP
Success
FP
1st
0.24
1.0
0.07
1.0
0.02
0.3
0.43
order
0.48
0.6
0.06
0.7
0.00
0.0
0.72
0.5
0.12
0.1
0.00
0.0
A
0.96
0.7
0.02
0.0
0.0
A
2nd
0.24
1.0
0.03
1.0
0.09
0.1
0.52
order
0.48
0.9
0.12
0.7
0.01
0.1
0.62
0.72
0.9
0.05
0.6
0.00
0.1
0.36
0.96
1.0
0.03
0.0
A
0
Success: true motif among top 5 found
80% site overlap
B
60% of true positives,
B
Criterion:
Progressive updating in stochastic dictionary provides a control–
treats repeats as adjacent words of the same pattern
A
Success
A
dependence
A
type
EV2
A
Matrix
Model Selection
Evaluating Motif Significance
back to the problem ... How many patterns to include in the model?
Bayes Factor for comparing model M1 (1 pattern and background)
to M0 (background only) is
"
"
#
∑ A O p A S O M1 d O
O p S O M0 d O
#
"
#
"
C
FGE C E
E
E
E
E
#
D
ρ Θ)
(O
p S M1
p S M0
Integration can be done, but sum involves summing over all
possible partitions of sequences!
Computational Approaches to BF
Bayes Factor:
unnormalized
#
#
D
#
#
c1
c0
"
"
∑A q A S M 1
p S M0
IH
∑A p A S M 1
p S M0
Approaches: Importance sampling, bridge sampling,
marginalization ...
J
and estimate
"
#
S O
Can get draws from p A
&
%
"
$
J
#
%
S O M1
&
ĉ1
S M1
1 N2
2i
q
A
∑
N2 i 1
1 N1
1i
p
A
∑
N1 i 1
"
"
$
#
biased estimate
Heuristics: use mixture of densities as trial distribution
#
S M1 :
"
But ... difficult to get “correct” samples from q A
Analytical Approximation– MAP Score
Maximal À Posteriori score
maxA P S A M1
J
Evaluated at “optimal” alignment A
"
#
#
Note:
J
So
J
MAP A
BF
A
#
#
A M1
"
"
∑P S
M1
LK
PS A
"
J
MAP A
#
J
P S A M1
P S M0
K
−2
−4
−6
−10
−8
log(MAP)
0
2
log(MAP) in Random Sequences for Varying w
6
8
10
12
14
16
w
Horizontal line corresponds to alignment with no motifs
Divergence of the MAP Score
Result 1
∞, MAP A
J
in N if model M1 is true.
As sequence length N
P
∞ at an exponential rate
(under certain simple conditions)
Result 2
MAP monotonically increases with increasing number of motifs if
the larger model is true
MAP as stopping criterion for number of words to include in the
dictionary– more useful as word length increases
Future Directions
Refine clustering model
More appropriate inter-motif distance models
Model selection criteria
Incorporate more biological knowledge
Correlated sequences (phylogenetic information)
Intra-motif correlations?
Broaden scope of the problem
Biological databases, knowledge growing at a rapid pace
Discovery of regulatory networks, relationships among genes
through study of regulatory motifs
References
Previous Work
Gibbs Motif Sampler Liu, JS, Neuwald, AF, and Lawrence, CL (1995). JASA 90,
1156–70.
BioProspector Liu, XS, Brutlag, D and Liu, JS (2001). P. Symp. Biocomp. 6,
127–38.
Dictionary Bussemaker, HJ, Li, H and Siggia, ED (2000). PNAS 97 10096–10100.
For this work
Discovery of conserved sequence patterns using a stochastic dictionary model.
Gupta, M and Liu, JS (2003). JASA 98, 55-66.
Statistical models for biological sequence motif discovery. Liu, JS, Gupta, M,
Liu, X, Mayerhofer,L and Lawrence, CE (2002) Case studies in Bayesian
Statistics 6.
Motif cluster prediction using a regulatory module framework. in preparation.
Download