DNA Properties and Genetic Sequence Alignment CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/19/09 Computational Molecular Biology Bioinformatics Genomics Proteomics Functional genomics Structural bioinformatics Sequence Alignment and Why Global Alignment Local Alignment Suppose a cloned gene If it is already in databases. Database Accession, Annotation (summary of structure), expression profile? Mutants? Its protein characteristics? -Sub-localization -Soluble? -3D fold Is there conserved regions?-Alignments?-Domains? Is there similar sequences?-% identity?-Family member? Evolutionary relationship?-Phylogenetic tree Scoring Matrices Alignment with Affine Gap Penalties Applying algorithms to analyze genomics data Applying Manhattan Tourist Problem to sequence comparison Local vs. Global Alignment Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C Local Alignment—better alignment to find conserved tccCAGTTATGTCAGgggacacgagcatgcagagac segment |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc Local alignment Global alignment a “mini” Global Alignment to get Local Some genes only have small conserved regions between species of organisms Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence. Find long ORFs, the longest ORFs first and putting together a set with minimal overlaps, identify the furthest upstream start codon. A new found ORF might not contain a real gene. Compare it with a gene known from other species. Different DNA sequences will give identical proteins (remember from the codon table). So there is large amount of redundancy in DNA sequence. Conservation of a sequence between species strongly suggests that the sequence has a function that is being conserved by natural selection. A protein being functional is more likely being conserved in evolutionary process than DNA. The organism’s survival depends on the protein being functional. The protein 3-dimensional structure is even more conserved, because it is more closely related to enzyme activity than the amino acid sequence is. Research subjects: To build 3-D structure from a DNA sequence Sequence Comparison Comparing ORF sequence to a database of known protein sequences from many species. BLAST is the standard (BLAST = Basic Local Alignment Search Tool) BLAST is based on the concept that if you compare the same (that is, homologous) protein from many different species. - Some amino acids readily substitute for each other and some others almost are unique and never will substitute. A substitution matrix, giving a score for each amino acid position in the proteins being compared These weights are given based on biological evidence. Alignments can be thought of as mutations in the sequence. Some of these mutations have little effect on the organism’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. Terminology: query sequence: sequence entered. Sequences matching are subject sequences. Gene B. meg. It is 174 amino acids lon, written in “fasta” format: Starts with > and immediately followed by an identifier (ORF00135), and then some comments. And then follows the sequence. Scoring matrices: Scoring matrices are created based on biological evidences. Alignments can be thought of as two sequences that differ due to mutations in the sequence. Some of these mutations have little effect on the organism’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. Different amino acids might have a positive score. Ie due to having positively charged. Score S is calculated by summing the scores assigned for matches, mismatches and gaps (creation/extension scores). The scores are given by the specified substitution matrix. PAM (Percent Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required. BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity. How the matrices were created: Very similar sequences were aligned. From these alignments, the frequency of substitution between each pair of amino acids was calculated and then PAM1 was built. After normalizing to log-odds format, the full series of PAM matrices can be calculated by multiplying the PAM1 matrix by itself. >ORF00135 |chromosome 538197-538721 revcomp MKAKLIQYVYDAECRLFKSVNQH FDRKHLNRFLRLLTHAGGATFTI VIACLLLFLYPSSVAYACAFSLA VSHIPVAIAKKLYPRKRPYIQLK HTKVLENPLKDHSFPSGHTTAIF SLVTPLMIVYPAFAAVLLPLAVM VGISRIYLGLHYPTDVMVGLILG IFSGAVALNIFLT - Some matrices reflect similarity: good for database searching - Some reflect distance: good for phylogenies - Log-odds matrices, a normalization method for matrix values: Sij = log (qij/(pi pj))=log (qij) – log (pi) – log (pj) Sij is the probability that two residues, i and j, are aligned by evolutionary descent and by chance. qij are the frequencies that i and j are observed to align in sequences known to be related. pi and pj are their frequencies of occurrence in the set of sequences. The most widely used local similarity algorithms are: Smith-Waterman (http://www.ebi.ac.uk/MPsrch/) Basic Local Alignment Search and Fast Alignment, which are based on k-tuple algorithms… Speedwise: BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a lots of computational power Sensitivity/statistics: FASTA is more sensitive to variations, misses less homologues Smith-Waterman is even more sensitive. BLAST calculates probabilities FASTA more accurate for DNA-DNA search then BLAST BLAST and FASTA variants Comparison between, FASTA: a DNA query to DNA db, or a protein query to protein db FASTX: a translated DNA query to a protein database TFASTA: a protein query to a translated DNA database BLASTN: a DNA query to DNA database. BLASTP: a protein query to protein database. BLASTX: 6-frame translations of DNA query to protein database. TBLASTN: a protein query to the 6-frame translations of a DNA database. TBLASTX: 6-frame translations of DNA query to the 6-frame translations of a DNA database. PSI-BLAST: Performs iterative database searches. The results from each round are incorporated into a 'position specific' score matrix, which is used for further searching BLAST results Detailed BLAST results E value: is the expectation value or probability to find by chance hits similar to your sequence. The lower the E, the more significant the score. Mostly genes are named with the function of their protein.at some point, some related genes had their function determined through lab work: by examining the effects of mutations in the gene, by isolating and studying the protein produced by the gene, etc. Enzymes (end in –ase), transport across the cell membrane, genetic information processing (DNA->RNA->protein), structural proteins, sporulation and germination, and more! Many genes (maybe 1/4 of them in a typical genome) have no known function, although they are found in several different species: conserved hypothetical genes Every new genome has some genes that are unique: no matching BLAST hits in the database. Are they real genes? Sometimes there is evidence in the form of messenger RNA, but usually we don’t know call them hypothetical genes “putative” means that we think we know the gene’s function but we aren’t sure. Putative should be followed by the function name. (Basic Similarity) Homology Search, The Knuth-Morris-Pratt Algorithm (exact match) All similarity searching methods rely on the concepts of alignment and a distance measurements between. Associated with a similarity score calculated from a distance matrix for example the number of DNA base that are different between. Exact String Matching Naive brute force algorithm searching for pattern p in text t, lengths m=|p| and n=|t|, then the worst case complexity Θ(nm) . Sliding the pattern across from left to right, one step at a time, but it is possible to shift more than one, given a certain length of the pattern sought is detected in the text. Just imagine it.. The Knuth-Morris-Pratt Algorithm - Complexity The algorithm of KMP algorithm takes into account the information gained during previous symbol comparisons, with prefix and suffix definitions. Never re-compares a text symbol that has matched a pattern symbol. The complexity of the searching phase is in O(n). Its preprocessing stage (the pattern search) has a complexity of O(m). m<=n. Therefore overall complexity is in O(n). The Knuth-Morris-Pratt Algorithm - Definitions Definition: Let A be an alphabet, and x be a string of length k, x=x_0, x_1, …, x_(k) over A. A prefix of x, is a substring u, u=u_0, u_1, …,u_b, where b<k, 0<={b,k}, A suffix of x, is a substring u, u=u_{k-b}, u_{k-b+1}, …, u_{k}, where 0<b<=k. Called proper prefix or suffix of u, if b<k. A border of x is a substring which appears as a proper suffix and a proper prefix of x and its length is b. Example x=abacab. Proper Prefixes: є, a, ab, aba, abac, abaca. Proper Suffixes: є, b, ab, cab, acab, bacab. Borders: є, ab, with lengths of 0 and 2. The empty string є is always a border of x, but itself has no border. 0123456789... abcabcabd abcabd abcabd Matching prefix p=abcab, length 5, mismatch occurs at position 5 between c and b, the widest (border length) - of the prefix of p matching to suffix of p -, b=2, then the shift distance is |p|-|b|=5-2. Theorem: Let r, s be borders of a string x, where |r| < |s|. Then r is a border of s. Proof. The Knuth-Morris-Pratt Algorithm – Preprocessing generating a look-up table Definition: Let x be a string and a A a symbol. A border r of x can be extended by a, if ra (r appended a) is a border of xa. j: 0123456 p[j]: ababaa b[j]: -1001231 void kmpPreprocess() { int i=0, j=-1; b[i] = j; while (i<m) { while (j>=0 && p[i]!=p[j]) j=b[j]; i++; j++; b[i] = j; } } //O(m) length of the pattern //initial j=-1, i=0; j<i.. //if mismatch, reduce j, max reduction until j<0 //if increase with.. The Knuth-Morris-Pratt Algorithm – Searching void kmpSearch() { int i=0, j=0; while (i<n) //Text length { while (j>=0 && t[i]!=p[j]) j=b[j]; i++; j++; if (j==m) //m pattern length { report(i-j); j=b[j]; } } } Example: 0123 4567 abab baba abab ac ab abac abab ab //if mismatch, max reduction to –j; //max j can be m, hits to a match 89... a ac ab ab //border length b[j] lookup ac abac Max number of comparisons 2n. j: 0123456 p[j]: ababaa b[j]: -1001231 Boyer-Moore algorithm Starts comparison from the rightmost, and if the rightmost does not match and if not occurs in the pattern at all, then the pattern can be shifted by m. The Boyer-Moore algorithm uses two different heuristics for determining the maximum possible shift distance in case of a mismatch: the "bad character" and the "good suffix" heuristics. Both heuristics can lead to a shift distance of m. 0 1 2 3 4 5 6 7 8 9 ... 0 1 2 3 4 5 6 7 8 9 ... abbadabacba aabababacba babac abbab babac a b b a b //borders matching Occurrence Function. occ(p, a) = max{ j | pj = a}, occ(text, x) = 2, occ(text, t) = max{0, 3}=3. The pattern is shifted by the longest of the two distances given by the bad chrtr and the good suffix heuristics. Left: Bad suffix character. Example of bad pattern heuristics (a special case of good suffix heuristics). Needs preprocessing, borders.. Requires only O(n/m) comparisons, if always the first symbol mismatch occurs. The preprocessing for the good suffix heuristics is rather difficult to understand and to implement. Therefore, modified versions of the BM algorithm in which the good suffix heuristics are avoided. The argument is that the bad character heuristics would avoid comparisons, while good suffix heuristics would not avoid. Bad character heuristics, - the Horspool algorithm or the Sunday algorithm suit better. Bad-character mismatch of BM, versus Horspool Algorithm, and Sunday’s Algorithm 0 1 2 3 4 5 6 7 8 9 ... 0 1 2 3 4 5 6 7 8 9 ... 0 1 2 3 4 5 6 7 8 9 ... abcabdbacba abcabdbacba abcabdbacba bcbab bcbab bcbab bcbab bcbab bcbab the rightmost position of a in p0 ... pm-2, or -1, occ(text, x) = 2, occ(text, t) = 0, occ(next, t) = -1. Occurrence at the leftmost (last bit) is not taken into count, best case performance O(n/m) On the Sunday’s shifts left of the d, since d does not occur in the pattern, and the comparison can be depend on the symbol probabilities, if known, then the least probable symbol in the pattern is compared first, hoping that it does not match, for the pattern be shifted. Skip Search Algorithm searches least likely pattern!!!.. Minimize the false positive matches. Information theoretic approach: Repetitive subsequences (ie poly AT) has low information content, random.. Minimum Message Length, (MML).. uses HMM (or PFSM).. Next week: Aligning Two Strings www.bioalgorithms.info\Winfried Just Represents each row and each column with a number and a symbol of the sequence present up to a given position. For example the sequences are represented as: Alignment as a Path in the Edit Graph 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)