Sequential Data Mining - Computer Science

advertisement
Sequential Data Mining
Jianlin Cheng
Department of Computer Science
University of Missouri, Columbia
2011
Sequential Data
• Sequence data: a series of alphabets or numbers
• Spatial sequence data: biological sequence, text
document, a computer program, web page, a
missile trajectory
• Temporal sequence data: speech, movie clips,
stock prices, a series of web clicks
Key Issue
• Measure the similarity between two
sequences, i.e. sequence comparison
• Sequence database search / retrieval
• Sequence classification
• Sequence clustering
• Sequence pattern recognition
• Sequence similarity definition and calculation
is a challenge compared to vector-based data
Sequence Alignment
•
•
•
•
Use biological sequences as an example
Optimal pairwise sequence alignment algorithm
Fast, heuristic sequence alignment (BLAST)
Statistical significance of sequence alignment
score
Sequence Data
• DNA Sequence (gene)
gaagagagcttcaggtttggggaagagaca
acaactcccgctcagaagcaggggccgata
• RNA Sequence
caaaucacuacacacaggguagaagguggaacgcacaggagcaugu
caacggggugc
• Protein Sequence
RELQVWGRDNNSLSEAGANRQGDVSFNLPQITLWQRPL
VTIKIGGQLKEALLDTGADDTVLEDIDLPGKWKP
Global Pairwise Sequence Alignment
ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL
ITAKPQWLKTSESVTFLSFLLPQTQGLYHL
Global Pairwise Sequence Alignment
ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL
ITAKPQWLKTSESVTFLSFLLPQTQGLYHL
ITAKPAKT-TSPKEQAIGLSVTFLSFLLPAG-VLYHL
ITAKPQWLKTSE-------SVTFLSFLLPQTQGLYHL
Alignment (similarity) score
Three Main Issues
1.Definition of alignment score
2.Algorithms of finding the optimal alignment
3.Evaluation of significance of alignment score
A Simple Scoring Scheme
• Score of a character pair: S(match)=1,
S(not_match) = -1, S(gap-char) = -1
n
• Score of an alignment =
S
i
1
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
ITAKPQWLKSTE-------SVTFLSFLLPQTQGLYHL
5 – 7 – 7 +10 -4 + 4= 1
Optimization Problem
• How many possible alignments exist for two
sequences with length m and n?
• How can we find the best alignment to
maximize alignment score?
Possible Alignments
Shortest
ACTG--ATGGATG
ACTG
ATGGATG
Intermediate
AC-TG--ATG-GATG
Longest
ACTG----------ATGGATG
Total Number of Possible Alignments
AGATCAGAAAT-G
--AT-AG-AATCC
A G A A T T C A A G GA A A A A T T C G C
m+n
Total Number of Alignments
Select m positions out of m+n possible positions:
 m  n  (m  n)!

 
m!n!
m 
Exponential!
If m = 300, n = 300, total = 1037
Needleman and Wunsch Algorithm
• Given sequences P and Q, we use a matrix M to record the
optimal alignment scores of all prefixes of P and Q. M[i,j] is
the best alignment score for the prefixes P[1..i] and Q[1..j].
• M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
Needleman and Wunsch Algorithm
• Given sequences P and Q, we use a matrix M to record the
optimal alignment scores of all prefixes of P and Q. M[i,j] is
the best alignment score for the prefixes P[1..i] and Q[1..j].
• M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
Dynamic Programming
Dynamic Programming Algorithm
Three-Step Algorithm:
•Initialization
•Matrix filling (scoring)
•Trace back (alignment)
1. Initialization of Matrix M
_ A
T
_
0
i
A
G
A
-1
T
C
A
G
-4
A
A
A
-8
T
G
-11
-2
-3
-5
-6
-7
-9
-10
-12
-1
-2
j
A
-3
G
-4
A
-5
A
-6
T
-7
2. Fill Matrix
_
_
0
A
-1
A
G
A
-1
1
-2
0
-3
-1
T
C
A
G
-4
T
-2
A
-3
G
-4
A
-5
A
-6
T
-7
-2
-5
-3
-6
-4
-7
-5
A
A
A
-8
-6
-9
-7
-10
-8
T
G
-11
-9
-12
-10
M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
2. Fill Matrix
_
_
0
A
-1
T
-2
A
G
A
T
C
A
G
A
A
A
-1
1
0
-2
0
0
-3
-1
-1
T
G
-4
-2
A
-3
G
-4
A
-5
A
-6
T
-7
0
-5
-3
-1
-6
-4
-2
-7
-5
-3
-8
-6
-4
-9
-7
-5
-10
-8
-6
-11
-9
-7
-12
-10
-8
M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
2. Fill Matrix
_
_
0
A
-1
T
-2
A
-3
G
A
A
T
-4
-5
-6
-7
A
G
A
-1
1
0
-1
-2
-3
-4
-5
-2
0
0
-1
0
-1
-2
-3
-3
-1
-1
1
-1
1
0
-1
T
C
A
G
A
A
A
-4
0
0
0
0
0
1
T
G
-2
-5
-3
-1
-1
-1
-1
-1
0
-6
-4
-2
0
-1
0
0
-1
-7
-5
-3
-1
1
0
-1
-1
-8
-6
-4
-2
0
2
1
0
-9
-7
-5
-3
-1
1
3
2
-10
-8
-6
-4
-2
0
2
2
-11
-9
-7
-5
-3
-1
1
3
-12
-10
-8
-6
-4
-2
0
2
M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
3. Trace Back
_
_
A
G
A
T
C
A
AGATCAGAAATG
G
--AT-AG-AATA
A
A
T
G
0
A
T
A
-1
-2
-3
G
A
A
T
-4
-5
-6
-7
-1
1
0
-1
-2
-3
-4
-5
-2
0
0
-1
0
-1
-2
-3
-3
-1
-1
1
-1
1
0
-1
-4
-2
0
0
0
0
0
1
-5
-3
-1
-1
-1
-1
-1
0
-6
-4
-2
0
-1
0
0
-1
-7
-5
-3
-1
1
0
-1
-1
-8
-6
-4
-2
0
2
1
0
-9
-7
-5
-3
-1
1
3
2
-10
-8
-6
-4
-2
0
2
2
-11
-9
-7
-5
-3
-1
1
3
-12
-10
-8
-6
-4
-2
0
2
Pairwise Alignment Algorithm Using Dynamic
Programming
• Initialization: Given two sequences with length m and n,
create a (m+1)×(n+1) matrix M. Initialize the first row and first
column according to scoring matrix.
• For j in 1..n (column)
for i in 1..m (row)
M[i,j] = max( (M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i1, j] + S(i,-) )
Record the selected path toward (i,j)
• Report alignment score M[m,n] and trace back to M[0,0] to
generate the optimal alignment.
What is the time complexity of this algorithm?
Local Sequence Alignment Using DP
• Biological sequences often only have local similarity.
• During evolution, only functional and structural
important regions are highly conserved.
• Global alignment sacrifices the local similarity to
maximize the global alignment score.
• Need to use alignment method to identify the local
similar regions disregard of other dissimilar regions.
Transcription binding site
Local Alignment Algorithm
Goal: find an alignment of the substrings of P and Q with
maximum alignment score.
Naïve Algorithm:
(m+1)*m/2 substrings of P, (n+1)*n/2 substrings of Q
Using DP for each substring pairs:
m2 * n2 * O(mn) = O(m3n3)
Smith-Waterman Algorithm
Same dynamic programming algorithm as global
alignment except for three differences.
1. All negative scores is converted to 0 (why?)
2. Alignment can start from anywhere in the
matrix
3. Alignment can end at anywhere in the
matrix
Local Alignment Algorithm
• Initialization: Given two sequences with length m and n,
create a (m+1)×(n+1) matrix M. Initialize the first row and first
column to 0s.
• For j in 1..n (column)
for i in 1..m (row)
M[i,j] = max( 0, M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j),
M[i-1, j] + S(i,-) )
Record the selected path.
• Find elements in matrix M with maximum values. Trace back
till 0 and report the alignment corresponding to the path.
1. Initialization
_
_
0
A
0
G
0
A
0
T
0
C
0
A
0
G
0
A
0
A
0
A
0
T
0
G
0
A
0
T
0
A
0
G
0
A
0
A
0
T
0
2. Fill the Matrix
_
A
T
A
G
A
A
T
_
0
0
0
0
0
0
0
0
A
0
1
0
1
0
1
1
0
G
0
0
0
0
2
1
0
0
A
0
1
0
1
0
3
2
1
T
0
0
2
1
0
2
2
3
C
0
0
1
0
0
1
1
2
A
0
1
0
2
1
1
2
1
G
0
0
0
1
3
2
1
1
A
0
1
0
1
2
4
3
2
A
0
1
0
1
1
3
5
4
A
0
1
0
1
0
2
4
4
T
0
0
2
1
0
1
3
5
G
0
0
1
1
2
1
2
4
3. Trace Back
_
A
T
A
G
A
A
T
_
0
0
0
0
0
0
0
0
A
0
1
0
1
0
1
1
0
G
0
0
0
0
2
1
0
0
A
0
1
0
1
0
3
2
1
T
0
0
2
1
0
2
2
3
C
0
0
1
0
0
1
1
2
A
0
1
0
2
1
1
2
1
G
0
0
0
1
3
2
1
1
A
0
1
0
1
2
4
3
2
A
0
1
0
1
1
3
5
4
A
0
1
0
1
0
2
4
4
T
0
0
2
1
0
1
3
5
G
0
0
1
1
2
1
2
4
Two Best Local Alignments
Local Alignment 1:
ATCAGAA
AT-AGAA
Local Alignment 2:
ATCAGAAAT
AT-AGA-AT
Scoring Matrix
• How to accurately measure the similarity
between amino acids (or nucleotides) is one
key issue of sequence alignment.
• For nucleotides, a simple identical / not
identical scheme is mostly ok.
• Due to various properties of amino acids, it is
hard and also critical to measure the similarity
between amino acids.
Evolutionary Substitution Approach
• During evolution, the substitution of similar
amino acids is more likely to be selected
within a protein family of similar sequences
than random substitutions (M. Dayhoff)
• The frequency / probability that one residue
substitutes another one is an indicator of their
similarity.
PAM Scoring Matrices
(M. Dayhoff)
• Select a number of protein families.
• Align sequences in each family and count the
frequency of amino acid substitution in each
column: Pij.
• Similarity score is logarithm of the ratio of
observed substitution probability over the
random substitution probability. S(i,j) = log(Pij /
(Pi * Pj)). Pi is the observed probability of residue
i and Pj is the observed probability of residue j
• PAM: Point Accepted Mutation
• Let data tell strategy! - NBA
A Simplified Example
ACGTCGAGT
ACCACGTGT
CACACTACT
ACCGCATGA
ACCCTATCT
TCCGTAACA
ACCATAAGT
AGCATAAGT
ACTATAAGT
ACGATAAGT
Chars
A
C
G
T
Prob.
6 / 10
1 / 10
2 / 10
1 / 10
P(A<->C) = 0.07+0.07=0.14
Substitution Frequency Table
A
C
G
T
A
30
6
12
6
C
6
0
2
1
G
12
2
1
2
T
6
1
2
0
Total number of substitutions: 90
A
C
G
T
A
.33
.07
.14
.07
C
.07
0
.02
.01
G
.14
.02
.01
.02
T
.07
.01
.02
0
A Simplified Example
ACGTCGAGT
ACCACGTGT
CACACTACT
ACCGCATGA
ACCCTATCT
TCCGTAACA
ACCATAAGT
AGCATAAGT
ACTATAAGT
ACGATAAGT
Chars
A
C
G
T
Prob.
6 / 10
1 / 10
2 / 10
1 / 10
Substitution Frequency Table
A
C
G
T
A
30
6
12
6
C
6
0
2
1
G
12
2
2
2
T
6
1
2
0
Total number of substitutions: 90
A
C
P(A<->C) = 0.07+0.07=0.14
S(A,C) = log(0.14/(0.6*0.1)) = 0.36 G
T
A
.33
.07
.14
.07
C
.07
0
.02
.01
G
.14
.02
.01
.02
T
.07
.01
.02
0
PAM250 Matrix (log odds multiplied by 10)
BLOSUM Matrices
(Henikoff and Henikoff)
• PAM calculation is based on global alignments
• PAM matrices don’t work well for aligning
sequences with little similarity.
• BLOSUM: BLOcks SUbstitution Matrix
• BLOSUM based on highly conserved local
regions /blocks without gaps.
Block 1
Block2
BLOSUM62 Matrix
Significance of Sequence Alignment Score
• Why do we need significant test?
• Mathematical view: unusual versus “by
chance”
• Biological view: evolutionary related or not?
Randomization Approach
• Randomization is a fundamental idea due to
a statistician Fisher.
• Randomly permute chars within sequence P
and Q to generate new sequences (P’ and
Q’). Align random sequences and record
alignment scores.
• Assuming these scores obey normal
distribution, compute mean (u) and standard
derivation (σ) of alignment scores
95%
Alignment score
Normal distribution of alignment scores of two sequences
•If S = u+2 σ, the probability of observing the alignment score equal to or more
extreme than this by chance is 2.5%, e.g., P(S>=u+2 σ) = 2.5%.
Thus we are 97.5% confident that the alignment score is not by chance, i.e. significant.
•For any score x, we can compute P(S >= x), which is called p-value.
Model-Based Approach
(Karlin and Altschul)
http://www.people.virginia.edu/~wrp/cshl02/Altschul/Altschul-3.html
• Extreme Value
Distribution
K and λ are statistical parameters depending
on substitution matrix. For BLOSUM62, λ =0.252, K=0.35
P-Value
• P(S≥x) is called p-value. It is the probability
that two random sequences has an alignment
score >= x.
• Smaller means more significant.
• E-value = p-value * # sequences
Problems of Using Dynamic Programming to
Search Large Sequence Databases
• Search homologs in DNA and protein database is
often the first step of a bioinformatics study.
• DP is too slow for large sequence database search
such as Genbank and UniProt. Each DP search can
take hours.
• Most DP search time is wasted on unrelated
sequences or dissimilar regions.
• Developing fast, practical sequence comparison
methods for database search is important.
Fast Sequence Search Methods
• All successful, rapid sequence comparison methods
are based on a simple fact: similar sequences share
some common words.
• First such method is FASTP (Pearson & Lipman, 1985)
• Most widely used methods are BLAST (Altschul et al., 1990)
and PSI-BLAST (Altschul et al., 1997).
Basic Local Alignment Search Tool
(S. Altschul, W. Gish, W. Miller, E. Meyer and D. Lipman)
1. Compile a list of words for
a query
2. Scan sequences in
database for word hits
3. Extending hits
David Lipman
Stephen Altschul
Step 1: Compile Word List
• Words: w-mer with length w.
• Protein 4-mer and DNA 12-mer
Query:
DSRSKGEPRDSGTLQSQEAKAVKKTSLFE
Words: DSRS, SRSK, RSKG, KGEP….
Step 2: Scan Database
Classic problem: find occurrence of a
list of words in a sequence.
•Integer indexing approach (hashing)
•Deterministic finite automaton or finite state machine (faster)
Finite State Machine
• A finite state machine for string matching:
(input alphabet: a,b,c)
Word: abab
a
a
a
0 a 1 b 2 a 3 b 4
Database sequence: bcabccaaababacababacabb
Step 3: Extension
• Extend words on both ends
• Terminate the process when we reach a segment pair
whose score falls a certain distance below the best
score found for shorter extensions.
• Depart from the ideal of finding a guaranteed
Maximum Segment Pair, but the added inaccuracy is
negligible.
• Report significant MSP according to extreme value
distribution
An Example of Extension
Query:
DSRSKGEPRDSGTLQSQEAKAVKKTSLFE
Words: DSRS, SRSK, RSKG, KGEP….
Database Sequence: PESRSKGEPRDSGKKQMDSOKPD
Maximum Segment Pair: ESRSKGEPRDSG
P-Value and E-Value
• P-value
• E-value = database
size * p-value
• Common threshold:
0.01
S
P-value = Prob(score >=S)
Why is BLAST so Successful?
•
•
•
•
Address a fundamental problem
Extremely fast
Simple, yet powerful idea
Well founded in computer theory (wordsstring matching, hashing, random process of
Maximum Segment Pair)
• Implementation tricks -> super speed software
• Sacrifice a little accuracy for speed practically
(good heuristics)  Realism
NCBI Online Blast
DNA Blast
Protein
Blast
Matched sequences ranked by score and evalue
Output
Format
Significant local alignments
Application of Pairwise Sequence
Comparison
•
•
•
•
Database search
Clustering
Classification
Motif recognition
Download