Bioinformatics

advertisement
Definitions
Optimal alignment - one that exhibits the
most correspondences. It is the alignment
with the highest score. May or may not
be biologically meaningful.
Global alignment - Needleman-Wunsch
(1970) maximizes the number of matches
between the sequences along the entire
length of the sequences.
Local alignment - Smith-Waterman (1981)
gives the highest scoring local match
between two sequences.
Pairwise Global Alignment

Global alignment - Needleman-Wunsch (1970)


maximizes the number of matches between the
sequences along the entire length of the sequences.
Reason for making a global alignment:



checking minor difference between two sequences
Analyzing polymorphisms (ex. SNPs) between closely related
sequences
…
Pairwise Global Alignment

Computationally:
 Given:
a pair of sequences (strings of characters)
 Output:
an alignment that maximizes the similarity
How can we find an optimal
alignment?
1



27
ACGTCTGATACGCCGTATAGTCTATCT
CTGAT---TCG-CATCGTC--T-ATCT
How many possible alignments?
C(27,7) gap positions = ~888,000 possibilities
Dynamic programming: The Needleman &
Wunsch algorithm
Time Complexity
Consider two sequences:
AAGT
AGTC
How many possible alignments the 2 sequences
have?
  = (2n)!/(n!)2 = (22n /n ) = (2n)
2n
n
Scoring a sequence alignment






Match/mismatch score:
+1/+0
Open/extension penalty:
–2/–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0
Open: 2 × (–2)
Extension: 5 × (–1)
Score = +9
Pairwise Global Alignment

Computationally:
 Given:
a pair of sequences (strings of characters)
 Output:
an alignment that maximizes the similarity
Needleman & Wunsch




Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with gap penalty multiples
Fill in the matrix with max value of 3 possible moves:





Vertical move: Score + gap penalty
Horizontal move: Score + gap penalty
Diagonal move: Score + match/mismatch score
The optimal alignment score is in the lower-right corner
To reconstruct the optimal alignment, trace back where the max at
each step came from, stop when hit the origin.
Example

Let gap = -2
match = 1
mismatch = -1.
empty
A
A
A
C
0
-2
-4
-6
-8
A
-2
1
-1
-3
-5
G
-4
-1
0
-2
-4
C
-6
-3
-2
-1
-1
empty
AAAC
A-GC
AAAC
-AGC
Time Complexity Analysis




Initialize matrix values: O(n), O(m)
Filling in rest of matrix: O(nm)
Traceback: O(n+m)
If strings are same length, total time O(n2)
Local Alignment

Problem first formulated:


Problem:


Smith and Waterman (1981)
Find an optimal alignment between a substring
of s and a substring of t
Algorithm:

is a variant of the basic algorithm for global
alignment
Motivation

Searching for unknown domains or motifs within
proteins from different families





Proteins encoded from Homeobox genes (only conserved
in 1 region called Homeo domain – 60 amino acids long)
Identifying active sites of enzymes
Comparing long stretches of anonymous DNA
Querying databases where query word much smaller
than sequences in database
Analyzing repeated elements within a single sequence
Local Alignment

Let gap = -2
match = 1
mismatch = -1.
empty
G
A
T
A
C
C
C
GATCACCT
GATACCC
GATCACCT
GAT _ ACCC
empty
G
A
T
C
A
C
C
T
0
0
0
0
1
0
0
0
0
0
0
2
0
1
0
0
0
0
3
1
0
0
0
0
1
2
2
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
2
1
1
0
0
0
3
2
2
0
0
1
4
3
1
0
0
2
3
0
0
0
0
0
Smith & Waterman




Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with 0s
Fill in the matrix with max value of 4 possible values:






0
Vertical move: Score + gap penalty
Horizontal move: Score + gap penalty
Diagonal move: Score + match/mismatch score
The optimal alignment score is the max in the matrix
To reconstruct the optimal alignment, trace back where the MAX
at each step came from, stop when a zero is hit
exercise

Let:
gap = -2
match = 1
mismatch = -1.

Find the best local alignment:
CGATG
AAATGGA
Semi-global Alignment
Example:
CAGCA-CTTGGATTCTCGG
–––CAGCGTGG––––––––
CAGCACTTGGATTCTCGG
CAGC––––G––T––––GG
We like the first alignment much better. In semiglobal
comparison, we score the alignments ignoring some of
the end spaces.
Global Alignment
Example:
AAACCC
A  CCC
empty
empty
A
A
A
C
C
0
-2
-4
-6
-2
1
-1
-3
-4
-1
0
-2
-6
-3
-2
-1
-8
-5
-2
-1
C
-8
-5
-4
-3
0
A
C
Prefer to see:
AAACCC
  ACCC
C
C
-10 -12
-7 -9
-4 -6
-1 -3
0
Do not want to penalize
the end spaces
0
SemiGlobal Alignment
Example:
s = AAACCC
t =  ACCC
empty
empty
A
C
C
C
A
A
A
C
C
C
0
-2
-4
0
1
-1
0
1
0
0
1
0
0
-1
2
0
-1
0
0
-1
0
-6
-8
-3
-5
-2
-4
-1
-3
1
0
3
2
1
4
SemiGlobal Alignment
Example:
s = AAACCCG
t =  ACCC
empty
empty
A
C
C
C
A
A
A
C
C
0
-2
-4
0
1
-1
0
1
0
0
1
0
0
-1
2
0
-1
0
-6
-8
-3
-5
-2
-4
-1
-3
1
0
3
2
C
G
0
0
-1 -1
0 -2
1 -1
4 2
SemiGlobal Alignment

Summary of end space charging procedures:
Place where spaces are not
penalized for
Action
Beginning of 1st sequence
Initialize 1st row with zeros
End of 1st sequence
Look for max in last row
Beginning of 2nd sequence
Initialize 1st column with zeros
End of 2nd sequence
Look for max in last column
Pairwise Sequence Comparison over Internet
lalign
www.ch.embnet.org/software/LALIGN_form.html
Global/Local
lalign
fasta.bioch.virginia.edu/fasta_www/plalign.htm
Global/Local
USC
www-hto.usc.edu/software/seqaln/seqaln-query.html
Global/Local
alion
fold.stanford.edu/alion
Global/Local
genome.cs.mtu.edu/align.html
Global/Local
align
www.ebi.ac.uk/emboss/align
Global/Local
xenAliTwo
www.soe.ucsc.edu/~kent/xenoAli/xenAliTwo.html
Local for DNA
blast2seqs
www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
Local BLAST
blast2seqs
web.umassmed.edu/cgi-bin/BLAST/blast2seqs
Local BLAST
lalnview
www.expasy.ch/tools/sim-prot.html
Visualization
prss
www.ch.embnet.org/software/PRSS_form.html
Evaluation
prss
Fasta.bioch.virginia.edu/fasta/prss.htm
Evaluation
graph-align
Darwin.nmsu.edu/cgi-bin/graph_align.cgi
Evaluation
Bioinformatics for Dummies
Significance of Sequence Alignment

Consider randomly generated sequences.
What distribution do you think the best local
alignment score of two sequences of sample
length should follow?
1.
2.
3.
4.
5.
Uniform distribution
Normal distribution
Binomial distribution (n Bernoulli trails)
Poisson distribution (n, np=)
others
Extreme Value Distribution

Yev = exp(- x - e-x )
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-5
0
5
Extreme Value Distribution vs.
Normal Distribution
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
-5
0
5
0
-5
0
5
“Twilight Zone”
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-5
0
Some proteins with less than 15% similarity have exactly
the same 3-D structure while some proteins with 20%
similarity have different structures. Homology/nonhomology is never granted in the twilight zone.
5
Download