Terms • • • • • • • • • Blocks: ungapped patterns of AAs that are present in related proteins. Conting: Assembled set of overlapping DNA sequence fragments Domain: portion of given protein sequence e value: parameter used in multiple sequence alignment (MSA). ESTs: Expressed Sequence Tags: partial sequence of cDNA copies of mRNA. Forward-backward algorithm: Forward-is method for summing the probabilities of all possible alignments of a sequence with HMM in forward direction; Backward-is for similar purpose but starts from end of the sequence. Both together provide probability. Genetic distance (between sequences): changes made in one sequence to make it another sequence. Gaps are not counted. Information theory: Analysis of variation in the columns of a PSSM representing the variation found in columns of an msa. Information helped in finding another sequence. High information means one matrix column is biased towards one sequence. Low when variety of different characters are present. Motif: a conserved pattern of AAs found in two or more proteins, normally present near the active site of proteins showing similar biochemical activity. • Phylogenetic tree: tree of ancestor relationships based on nucleic acid and protein sequences. Similar sequences at adjacent outer branches joining common node while distantly related sequence on additional nodes. Branch length shows changes between adjacent nodes in the tree. • Profile: a scoring matrix representation of a conserved region in the msa that allows for gaps in the alignments. Rows include scores for matching sequential columns while Columns include substitution scores for AAs and Gap penalties. • Rooted tree: All sequences are descended from a common point in one of the tree branches. Path from the point through the tree defines the predicted evolutionary path to that sequence. • Unrooted tree: A tree representation for a group of related sequences that does not indicate that which of the sequences is the ancestor of the others. Multiple Sequence Alignment Three programs are used for Progressive sequence alignment: CLUSTALW (Thompson et al., 1994a, 1997): New version of CLUSTAL (Higgins and Sharp 1988) in which W stand for Weighting, weight of sequence. Perform pair wise alignments of all the sequences Alignment scores are used to produce phylogenetic tree by neighbor joining method Align the sequences sequentially by dynamic programming algorithm based on phylogenetic relationships, First closely related sequences are aligned. Gaps are calculated that preferentially found between secondary structural elements. CLUSTALX (Higgins et al., 1996) is graphic interface. 0.2 Weighting Factor A 0.2+03/2=0.35 0.2 B 0.2+03/2=0.35 0.5 C 0.5 0.3 A. Calculation of sequence weights: Largest weight is 1 Use of CLUSTALW Sequence A (weight a) Sequence B (weight b) Sequence C (weight c) Sequence D (weight d) Column in alignment 1 ---------K------------I------Column in alignment 2 --------L-------------V----- Score for matching these two column in an msa: [ a x c x score (K,L) + a x d x score (K,V)+ b x c x score (I, L) + b x d x score (I, V) ]/4 Weight is directly proportional to distance of the sequence of two columns PILEUP PILEUP a msa program, part of GCG-Genetics Computer Group package. Uses method similar to CLUSTALW. Less advance than CLUSTALW which can use sequence weighting or gap modifications. Pileup is not capable to reduce dominating effect to give emphasis on gaps. Sequence is aligned pair wise using UPGMA (Sneath and Sokal 1973)-Unweighted Pair-group Method using Arithmetic averages. Interactive refinement Multiple alignment & UPGMA TREE Constructed from the bHLH DNA binding motifs UPGMA • The UPGMA is developed for constructing taxonomic phenograms, i.e. trees that reflect the phenotypic similarities between OTUs, (Operational Taxonomic Units) • It can also be used to construct phylogenetic trees considering constant evolution of different lineages. • UPGMA uses sequential clustering algorithm to identify local topological relationships are identified on the basis of similarity. • This is used in step wise manner to build phylogenetic tree. • First of all two OTUs that are most similar to each other are identified and then treated as a new single OUT called composite OTU. • Subsequently new group of OTUs is identified with the highest similarity, and so on, until we are left with only two OTUs. Phylogenetic tree of 6 OTUs Evolutionary distance (distance matrix) should be as follows: First Cycle B C D E F A 2 4 6 6 8 B C D E 4 6 6 8 6 6 8 4 8 8 Pair two OTUs with smallest distance i.e. of 2 first Branching point will at a distance of 2/2=1 1 1 A B Calculation of new distance matrix Dist (A,B),C = (dist AC + dist BC) / 2 = 4 Dist (A,B),D = (dist AD + dist BD) / 2 = 6 Dist (A,B),E = (dist AE + dist BE) / 2 = 6 Dist (A,B),F = (dist AF + dist BF) / 2 = 8 Calculation of new distance matrix for Second Cycle C D E F A,B 4 6 6 8 C D E 6 6 8 4 8 8 2 2 D E Calculation of new distance matrix for Third Cycle A,B C 4 D,E 6 E 6 F 8 C D,E 1 1 6 6 8 1 2 A B C 8 Calculation of new distance matrix for Fourth Cycle 1 AB,C D,E 6 F 8 D,E 6 8 1 1 1 2 2 1 2 A B C D E Calculation of new distance matrix for Fifth Cycle F ABC,DE 8 UPGMA: Leads to an unrooted tree, Assumes equal rates of mutation along all the branches. The root must be equidistant from all OTUs. Hence mid-point rooting method is applied. The root of the entire tree is positioned at dist (ABCDE),F / 2 = 4. Pitfalls in UPGMA: This clustering method is very sensitive to unequal evolutionar rates. Means one of the OTUs has incorporated more mutations over time,than the other OTU, Giving a tree that has the wrong topology. Clustering works only if the data are ultrametric, i.e.the satisfaction of the 'three-point condition'. T-COFFEE (Tree based Consistency based Objective Function For alignmEnt Evalution) • Advanced progressive alignment • It is t align a set of sequences gathered using programs such as Blast, FASTA, etc. • Combine results obtained by several alignment method • Program starts with both global and local alignments • www.tcoffee.org • Color codes: indicator of the reliability of the alignment. • Red bits are the more consistent and therefore the more likely to be correctly aligned. • Blue bits are the less trustable. • T-coffee is used to identify faulty gene expression (Cedric Notredame and Chantal Abergel) • Can be used to make your own library • Measure the consistency on a multiple sequence alignment • Using core measure to assess local alignment quality Identifying Correct Blocks With core measures Identifying Frameshifts and Start codons HMM-Hidden Markov Models • Statistical models that consider all possible combinations of matches, mismatches and gaps to generate an alignment of set of sequences. • These models are used both for protein sequences, DNA sequences such as RNA splice junctions. • This model take into account the lengths of the sequences and insertions and deletions is first produced and initialized with prior informations, i.e. guess of the expected variation in each position of the multiple sequence alignment. • Previously used for speech recognition • Used to analyze sequence composition and patterns • To locate genes by predicting open reading frame (ORF) • To produce protein structure predictions Adv. & Disadv of HMM • Advantages: – Better than global and local alignment methods, including profiles and scoring matrices. – It is well within the probability theory – No sequence ordering is required – Guess of insertions/deletion penalties are not needed and – Experimentally derived information can be used. – Can naturally accommodate variable length models of regions of sequence • Disadvantages: – At least 20 sequences or more than that is required to accommodate the evolutionary history. • The software is available at – www.cse.ucsc.edu/research/ompbio/sam.html – www.hmmer.wustl.edu HMM representation for Gene • Bayesian statistics framework is used with HMM because it converts the likelihood data into a posteriori probability. • Posteriori probability includes that ability to integrate prior knowledge about the way in which the protein evolved. HMM representation for Protein -Transition probability M – Match –consensus AAs I - insert-insertion of residues D – Delete-skipping the consensus position Theoretical contributions from bioinformatics • Small data set usage: • Novel decoding methods uses posteriori decoding • General extensions of techniques • Open areas for research in HMM • Integration of structural information into profile HMMs • Model architecture: use simple one to fit data • Biological mechanism: HMM find gene in genomic DNA context, which is not handled by biological machine that processes RNA PSSM-Position Specific Scoring Matrices http://www.ncbi.nlm.nih.gov/Class/Structure/pssm/pssm_viewer.cgi http://www.sbg.bio.ic.ac.uk/3dpssm/ • Used to search a sequence to obtain the most probable locations or locations of motif. • Used to search entire database to identify additional sequences that have same motif. • By simple logarithmic transformation of a matrix giving frequency of AAs in motif. • PSSM created using PSI-BLAST or NCBI CDD database • CD records can be obtained from Entrez conserved domains by using RPS-BLAST known as CD-search. • Positive integers/scores show substitution occur more frequently in the alignment while negative indicate that sub. Occurs less frequently. Large positive scores indicate critical functional residues. • Position Independent matrices e.g. PAM & BLOSUM in which TyrTrp substitution receive the same scores in respect of position. Sequence logo • • • • • • • • • • • • Represents amount of information in each column of a motif. The horizontal scale represents sequential positions in the motif. Higher the column the more useful that position for finding matches in sequences. In each column symbols of the AAs found at the corresponding position of the motif, with the height of the AA proportional to the frequency of that amino acid in the column. AAs is shown in the decreasing order of abundance from the top of the column. The relative frequency of each AA in each column of the motif is given by the size of the letters in each column. The total height of the column provides a measure of how useful that column is for reducing the level of uncertainty in a sequence matching experiment. If the data set is small then useless the motif has almost identical amino acids in each column. It is desirable to improve the estimates of AA frequencies by adding userdefined, extra AA counts called pseudocounts. Adding pseudocounts improved estimate of probability Pca where ‘a’ is AA in column, ‘c’ is column in all occurrences of the blocks. Pca represents fca, frequency of counts in the data. Bbayesian prediction of Pca is – Pca = (nca + bca) / (Ni + Bi) ; nca – real counts; bca-pseudocounts; aamino acid in coulmn; Nc and Bc is total no. of real counts and pseudocounts respectively. – fca + nca / Ni