Uploaded by rajpot.twinkle1984

Alignment-Bioinfor

advertisement
Terms
•
•
•
•
•
•
•
•
•
Blocks: ungapped patterns of AAs that are present in related proteins.
Conting: Assembled set of overlapping DNA sequence fragments
Domain: portion of given protein sequence
e value: parameter used in multiple sequence alignment (MSA).
ESTs: Expressed Sequence Tags: partial sequence of cDNA copies of
mRNA.
Forward-backward algorithm: Forward-is method for summing the
probabilities of all possible alignments of a sequence with HMM in forward
direction; Backward-is for similar purpose but starts from end of the
sequence. Both together provide probability.
Genetic distance (between sequences): changes made in one sequence to
make it another sequence. Gaps are not counted.
Information theory: Analysis of variation in the columns of a PSSM
representing the variation found in columns of an msa. Information helped
in finding another sequence. High information means one matrix column is
biased towards one sequence. Low when variety of different characters are
present.
Motif: a conserved pattern of AAs found in two or more proteins, normally
present near the active site of proteins showing similar biochemical activity.
• Phylogenetic tree: tree of ancestor relationships
based on nucleic acid and protein sequences. Similar
sequences at adjacent outer branches joining
common node while distantly related sequence on
additional nodes. Branch length shows changes
between adjacent nodes in the tree.
• Profile: a scoring matrix representation of a
conserved region in the msa that allows for gaps in
the alignments. Rows include scores for matching
sequential columns while Columns include
substitution scores for AAs and Gap penalties.
• Rooted tree: All sequences are descended from a
common point in one of the tree branches. Path from
the point through the tree defines the predicted
evolutionary path to that sequence.
• Unrooted tree: A tree representation for a group of
related sequences that does not indicate that which
of the sequences is the ancestor of the others.
Multiple Sequence Alignment
Three programs are used for Progressive sequence alignment:
CLUSTALW (Thompson et al., 1994a, 1997): New version of CLUSTAL
(Higgins and Sharp 1988) in which W stand for Weighting, weight of
sequence.
Perform pair wise alignments of all the sequences
Alignment scores are used to produce phylogenetic tree by neighbor joining
method
Align the sequences sequentially by dynamic programming algorithm based
on phylogenetic relationships, First closely related sequences are aligned.
Gaps are calculated that preferentially found between secondary structural
elements.
CLUSTALX (Higgins et al., 1996) is graphic interface.
0.2
Weighting Factor
A 0.2+03/2=0.35
0.2
B 0.2+03/2=0.35
0.5
C 0.5
0.3
A. Calculation of
sequence weights:
Largest weight is 1
Use of CLUSTALW
Sequence A (weight a)
Sequence B (weight b)
Sequence C (weight c)
Sequence D (weight d)
Column in alignment 1
---------K------------I------Column in alignment 2
--------L-------------V-----
Score for matching these two column in an msa:
[
a x c x score (K,L) +
a x d x score (K,V)+
b x c x score (I, L) +
b x d x score (I, V)
]/4
Weight is directly proportional to distance of the sequence of two columns
PILEUP
PILEUP a msa program, part of GCG-Genetics
Computer Group package.
Uses method similar to CLUSTALW.
Less advance than CLUSTALW which can use
sequence weighting or gap modifications.
Pileup is not capable to reduce dominating effect to
give emphasis on gaps.
Sequence is aligned pair wise using UPGMA
(Sneath and Sokal 1973)-Unweighted Pair-group
Method using Arithmetic averages.
Interactive refinement
Multiple alignment
& UPGMA TREE
Constructed from the
bHLH DNA binding
motifs
UPGMA
• The UPGMA is developed for constructing taxonomic phenograms,
i.e. trees that reflect the phenotypic similarities between OTUs,
(Operational Taxonomic Units)
• It can also be used to construct phylogenetic trees considering
constant evolution of different lineages.
• UPGMA uses sequential clustering algorithm to identify local
topological relationships are identified on the basis of similarity.
• This is used in step wise manner to build phylogenetic tree.
• First of all two OTUs that are most similar to each other are
identified and then treated as a new single OUT called composite
OTU.
• Subsequently new group of OTUs is identified with the highest
similarity, and so on, until we are left with only two OTUs.
Phylogenetic tree of 6 OTUs
Evolutionary distance (distance matrix) should be as follows: First Cycle
B
C
D
E
F
A
2
4
6
6
8
B
C
D
E
4
6
6
8
6
6
8
4
8
8
Pair two OTUs with smallest distance i.e. of 2 first
Branching point will at a distance of 2/2=1
1
1
A
B
Calculation of new distance matrix
Dist (A,B),C = (dist AC + dist BC) / 2 = 4
Dist (A,B),D = (dist AD + dist BD) / 2 = 6
Dist (A,B),E = (dist AE + dist BE) / 2 = 6
Dist (A,B),F = (dist AF + dist BF) / 2 = 8
Calculation of new distance matrix for Second Cycle
C
D
E
F
A,B
4
6
6
8
C
D
E
6
6
8
4
8
8
2
2
D
E
Calculation of new distance matrix for Third Cycle
A,B
C 4
D,E 6
E 6
F 8
C
D,E
1
1
6
6
8
1
2
A
B
C
8
Calculation of new distance matrix for Fourth Cycle
1
AB,C
D,E 6
F 8
D,E
6
8
1
1
1
2
2
1
2
A
B
C
D
E
Calculation of new distance matrix for Fifth Cycle
F
ABC,DE
8
UPGMA:
Leads to an unrooted tree,
Assumes equal rates of mutation along all the branches.
The root must be equidistant from all OTUs.
Hence mid-point rooting method is applied.
The root of the entire tree is positioned at dist (ABCDE),F / 2 = 4.
 Pitfalls in UPGMA:
 This clustering method is very sensitive to unequal evolutionar rates.
 Means one of the OTUs has incorporated more mutations over
time,than the other OTU,
 Giving a tree that has the wrong topology.
 Clustering works only if the data are ultrametric, i.e.the satisfaction of
the 'three-point condition'.
T-COFFEE (Tree based Consistency based Objective
Function For alignmEnt Evalution)
• Advanced progressive alignment
• It is t align a set of sequences gathered using programs such as
Blast, FASTA, etc.
• Combine results obtained by several alignment method
• Program starts with both global and local alignments
• www.tcoffee.org
• Color codes: indicator of the reliability of the alignment.
• Red bits are the more consistent and therefore the more likely to be
correctly aligned.
• Blue bits are the less trustable.
• T-coffee is used to identify faulty gene expression (Cedric
Notredame and Chantal Abergel)
• Can be used to make your own library
• Measure the consistency on a multiple sequence alignment
• Using core measure to assess local alignment quality
Identifying
Correct
Blocks
With core
measures
Identifying Frameshifts and Start
codons
HMM-Hidden Markov Models
• Statistical models that consider all possible
combinations of matches, mismatches and gaps
to generate an alignment of set of sequences.
• These models are used both for protein
sequences, DNA sequences such as RNA splice
junctions.
• This model take into account the lengths of the
sequences and insertions and deletions is first
produced and initialized with prior informations,
i.e. guess of the expected variation in each
position of the multiple sequence alignment.
• Previously used for speech recognition
• Used to analyze sequence composition
and patterns
• To locate genes by predicting open
reading frame (ORF)
• To produce protein structure predictions
Adv. & Disadv of HMM
• Advantages:
– Better than global and local alignment methods, including profiles and
scoring matrices.
– It is well within the probability theory
– No sequence ordering is required
– Guess of insertions/deletion penalties are not needed and
– Experimentally derived information can be used.
– Can naturally accommodate variable length models of regions of
sequence
• Disadvantages:
– At least 20 sequences or more than that is required to accommodate
the evolutionary history.
• The software is available at
– www.cse.ucsc.edu/research/ompbio/sam.html
– www.hmmer.wustl.edu
HMM representation for Gene
• Bayesian statistics framework is used with
HMM because it converts the likelihood
data into a posteriori probability.
• Posteriori probability includes that ability to
integrate prior knowledge about the way in
which the protein evolved.
HMM representation for Protein
-Transition probability
M – Match –consensus AAs
I - insert-insertion of residues
D – Delete-skipping the
consensus position
Theoretical contributions from bioinformatics
• Small data set usage:
• Novel decoding methods uses posteriori
decoding
• General extensions of techniques
• Open areas for research in HMM
• Integration of structural information into profile
HMMs
• Model architecture: use simple one to fit data
• Biological mechanism: HMM find gene in
genomic DNA context, which is not handled
by biological machine that processes RNA
PSSM-Position Specific Scoring Matrices
http://www.ncbi.nlm.nih.gov/Class/Structure/pssm/pssm_viewer.cgi
http://www.sbg.bio.ic.ac.uk/3dpssm/
• Used to search a sequence to obtain the most probable locations or
locations of motif.
• Used to search entire database to identify additional sequences that
have same motif.
• By simple logarithmic transformation of a matrix giving frequency of
AAs in motif.
• PSSM created using PSI-BLAST or NCBI CDD database
• CD records can be obtained from Entrez conserved domains by
using RPS-BLAST known as CD-search.
• Positive integers/scores show substitution occur more frequently in
the alignment while negative indicate that sub. Occurs less
frequently. Large positive scores indicate critical functional residues.
• Position Independent matrices e.g. PAM & BLOSUM in which TyrTrp substitution receive the same scores in respect of position.
Sequence logo
•
•
•
•
•
•
•
•
•
•
•
•
Represents amount of information in each column of a motif.
The horizontal scale represents sequential positions in the motif.
Higher the column the more useful that position for finding matches in
sequences.
In each column symbols of the AAs found at the corresponding position of the
motif, with the height of the AA proportional to the frequency of that amino
acid in the column.
AAs is shown in the decreasing order of abundance from the top of the
column.
The relative frequency of each AA in each column of the motif is given by the
size of the letters in each column.
The total height of the column provides a measure of how useful that column
is for reducing the level of uncertainty in a sequence matching experiment.
If the data set is small then useless the motif has almost identical amino acids
in each column.
It is desirable to improve the estimates of AA frequencies by adding userdefined, extra AA counts called pseudocounts.
Adding pseudocounts improved estimate of probability Pca where ‘a’ is AA in
column, ‘c’ is column in all occurrences of the blocks.
Pca represents fca, frequency of counts in the data.
Bbayesian prediction of Pca is
– Pca = (nca + bca) / (Ni + Bi) ; nca – real counts; bca-pseudocounts; aamino acid in coulmn; Nc and Bc is total no. of real counts and
pseudocounts respectively.
– fca + nca / Ni
Download