Learning Objectives for Sequence Analysis Lecture. by Dr. Ilya Ioshikhes, Department of Biomedical Informatics, 3017 Graves Hall, Tel. 292-6514, E-mail: ioschikhes-1@medctr.osu.edu Building blocks of DNA and RNA. 1. What are the building blocks of DNA? 2. What nucleotides are involved in DNA and RNA structure? 3. How the information is transferred from DNA to proteins? Building blocks of proteins. 4. What are the building blocks of proteins? 5. What is the primary factor determining a protein’s shape and structure? DNA and proteins (polypeptides) as sequences. 6. What components are needed to build a sequence? 7. How DNA and proteins considered in sequence analysis? Basic approaches in sequence analysis. 8. What are the basic approaches to compare two sequences? 9. What are global and local alignment, and what is the difference? 10. In which cases do we need to compare entire sequences, and which – their segments? 11. Which kind of alignment is usually more useful for gene detection? Algorithms and software for pair-wise sequence comparisons. 12. What are the most popular algorithms you know? 13. How are sequences with mismatches compared? 14. What algorithm would you use to find regions conserved between two proteins? Multiple sequence alignment (MSA) and molecular evolution. 15. What is the difference between the MSA and pair-wise sequence alignment? 16. Explain relationship of molecular evolution and MSA. 17. How sequence mutations are represented in MSA? 18. What are basic types of sequence mutations? 19. How pair-wise sequence alignment is used in MSA? Basic approaches in MSA. 20. What are the basic steps necessary for comparison of multiple sequences? 21. Explain difference between progressive and iterative approaches for MSA. 22. In which cases do we need to compare entire sequences, and which – their segments? 23. Which of the described approaches of the MSA are based on global, and which – on local alignment? 24. What kind of alignment is more useful for finding of conserved regions in protein sequences? Algorithms and software for comparisons of multiple sequences. 25. What are the most popular algorithms you know? 26. Explain basic idea of the MSA scoring. 27. What algorithm would you use to find regions conserved between several homologous genes? Searching of databases for similar sequences. 28. What is the most common type of database similarity search? 29. Why strict algorithms of the pair-wise sequence comparison are not the best when we want to compare query sequences with a large database? 30. Which type of the pair-wise comparison is more useful for the database searches? 31. What algorithms of the database searches do you know? 32. Try to explain principles of their work. Key information elements. Biological background for sequence analysis. 1. There are four nucleotides (A,C,G,T) serving as building blocks for DNA molecules, and 20 amino acids serving as building blocks for proteins. 2. Genes are DNA segments encoding information for synthesis of proteins. 3. Triplet code governs transmission of the information from genes to proteins: 3 nucleotides encode 1 amino acid. 4. The two major steps of the information transmission are transcription (synthesis of mRNA from DNA) and translation (synthesis of proteins from mRNA). There are also other stages in this process (splicing etc.) 5. DNA is a double stranded molecule. For most of the purposes of sequence analysis, however, knowledge of one strand is sufficient, because another one may be restored by the rules of complementarity (A is complement to T, and C – to G). 6. The shape (structure) of a protein molecule is primarily determined by its amino acid sequence. 7. Sequence is the order of the constituent subunits of a large biological molecule, for example, the order of amino acids in a protein or the order of nucleotides in amino acids. Two components compose the sequence: its subunits (building blocks) and their order. 8. DNA molecules are sequences of nucleotides. Protein molecules (polypeptides) are sequences of amino acids. Properties of these molecules are largely defined by their sequences, so to study and compare the molecules, we must study and compare their sequences. Basic terms and approaches for sequence analysis. 9. Sequence comparison starts from comparison of two sequences (pair-wise comparison). 10. Basic approaches for the pair-wise comparison: 1) dot matrix, 2) exhaustive alignment including all possible combinations (practically infeasible), 3) dynamic pair-wise alignment (letter by letter) and 4) alignment by word methods. 11. Global and local alignments. In global alignment, the entire sequences are aligned, using as many characters as possible, up to both ends of each sequence. In local alignment, the sequence segments with the highest density of matches are aligned. 12. Global alignment is best suited for quite similar sequences. Local alignment is best suited for alignment of sequences with only local similarity. 13. Optimal alignment is one with the best total score of possible matches, mismatches and gaps (insertions and deletions). The score is typically a sum of amino acid (or nucleotide) pair scores minus penalties for gaps (their opening and length). Special matrices are used for the pair scoring. 14. Dynamic programming is progressive building of an alignment by comparing two residues at a time, moving through all matching positions from one end of each sequence (segment) to another with scoring each point; alignment with the highest score is chosen. 15. For scoring of an alignment, special matrices are used: Dayhoff Amino Acid Substitution Matrices (Percent Accepted Mutation or PAM Matrices) – list the likelihood of change from one amino acid to another in homologous protein sequences during evolution, for a certain period of evolutionary time. Blocks Amino Acid Substitution Matrices (BLOSUM) by Henikoff and Henikoff – based one the observed amino acid substitutions in a large set of ~2000 conserved amino acid patterns, called blocks. 16. In word methods, sequences are broken down into short words, and combinations of the words are further compared to find similar regions. Used mostly for database searches. 17. The most popular software for pair-wise sequence comparisons: 1) DotPlot and Compare (dot matrix approach); 2) GAP (global alignment, Needleman-Wunsch dynamic algorithm); 3) BestFit (local alignment, Smith-Waterman dynamic algorithm); 4) LALIGN (finding multiple unique nonintersecting local alignments); 5) FASTA and BLAST (word algorithms for database searches). 18. Following scheme is helpful for resolving of the problems of pair-wise sequence comparison: Multiple sequence alignment (MSA) 19. DNA and protein sequences of different organisms are often related. Genes with similar function are conserved across widely divergent species. 20. Through simultaneous alignment of the sequences of these genes, sequence patterns that have been subject to conservation and alteration may be analyzed. 21. In MSA, sequences are aligned optimally by bringing the greatest number of similar characters into register in the same column of the alignment, just as for the alignment of two sequences. 22. Finding of optimal MSA of more than two sequences, including matches, mismatches, and gaps, poses a very difficult challenge. Algorithms used for optimal alignment of pairs of sequences can be extended to three sequences, but for more than three sequences, only a small number of relatively short sequences may be analyzed. 23. Thus, approximate heuristic methods are used, including (1) a progressive global alignment of the sequences starting with an alignment of the most alike sequences and then building an alignment by adding more sequences, (2) iterative methods that make an initial alignment of groups of sequences and then revise the alignment to achieve a more reasonable result, (3) alignments based on locally conserved patterns found in the same order of the sequences. 24. MSA calculates the multiple alignment score by adding the scores of the corresponding pair-wise alignments in the MSA. 25. Optimizing the MSA is achieved by optimizing the score, that is by maximizing the number of matched pairs (or minimizing the cost or number of mismatched pairs) summed over all columns in the MSA. 26. The MSA of a set of sequences may also be viewed as an evolutionary history of the sequences. 27. MSA of a set of sequences can provide information as to the most alike regions in the set. In DNA, conserved regions may possibly be inherited from the ancestor sequence. In proteins, such regions may represent conserved functional or structural domains. 28. CLUSTAL, one of the popular computer programs for progressive MSA, performs a global-multiple sequence alignment by following steps: (1) Perform pair-wise alignments of all of the sequences; (2) use the alignment scores to produce a phylogenetic tree; and (3) align the sequences sequentially, guided by the phylogenetic relationships indicated by the tree. 29. Following scheme is helpful for resolving of the problems of multiple sequence comparison: Database similarity searches. 30. With availability of several complete genomes (including Human genome) and other sequence data, there is an ample information as to the biological function of particular sequences in model organisms that may be exploited to predict the function of similar genes in other organisms. 31. In the database searches, the sequence of the gene or protein of interest is compared to every sequence in a sequence database, and the similar ones are identified. Alignments with the best-matching sequences are shown and scored. If a query sequence can be readily aligned to a database sequence of known function, structure, or biochemical activity, the query sequence is predicted to have the same function, structure or biochemical activity. 32. Efficiency of the predictions depends on the quality of the alignment between the sequences. Roughly, if more than one-half of the amino acid sequence of query and database proteins, or appropriate number of nucleotides for DNA is identical in the sequence alignments, the prediction is very strong. 33. There are different possible kinds of database searches: (1) a standard search of a sequence database with a query sequence; (2) search for a characteristic sequence pattern in database or query. 34. Searching a sequence database for sequences that are similar to a query sequence is the most common type of database similarity search. The search provides a list of database sequences with which the query sequence can be aligned, with statistical evaluation of the alignment scores. 35. Number of the necessary database searches and large size of the sequence databases made necessary development of algorithms for pair-wise sequence comparisons that are more quick than strict algorithms for the pair-wise alignment. 36. Heuristic (tried-and-true) methods of database searches (FASTA and BLAST) work at least 50 times faster than the strict algorithms. 37. These are word methods, comparing the sequences by short common patterns in the query and database sequences and joining the patterns into an alignment. 38. BLAST gained a further increase in speed by searching only for more significant patterns in nucleic acid and protein sequences. BLAST is very popular due to availability of the program on the World Wide web through a large server at the National Center for Biotechnology Information (NCBI) (http://ncbi.nlm.nih.gov) and at many other sites. 39. With the more recent increased speed and size of computers and algorithmic improvement, Smith-Waterman full sequence alignment also may be used for the database searches. It is 50-fold or more slower than FASTA and BLAST, but is able to find more distantly related sequences. 40. Following scheme represents typical stages of sequence analysis involving database searches: The 1/20/04 Home Assignment: 1. Find protein homologs to the protein query sequence (alpha-fetoprotein precursor from Homo Sapiens) using BLASTP program (http://www.ncbi.nlm.nih.gov/). 2. Find DNA homologs to the same query sequence using TBLASTN. 3. Perform a global pairwise sequence alignment of the protein query sequence from 1. with the top homolog found in 1. using GAP (SeqWeb program from http://gene.med.ohio-state.edu/gcg-bin/seqweb.cgi) 4. Find the segments of best similarity between top DNA homologs found in 2. using local optimal sequence alignment program BestFit of SeqWeb. 5. Make multiple sequence alignment of the top protein homologs found in 1. using progressive pairwise alignments (program PileUp of SeqWeb). 6. Find a consensus sequence based on results obtained in 5. using PRETTY program of SeqWeb. 7. Using programs for pair-wise and multiple sequence alignment and database searches, perform a comparative analysis of similarity of conserved segments in promoters and coding regions of genes, most similar to a gene encoding entire protein with a given fragment: >query MAKNTAIGIDLGTTYSCVGVFQHGKVEIIANDQGNRTTPSYVAFTDTERLIGDA AKNQVALNPQNTVFDAKRLIGRKFGDAVVQSDMKHWPFQVVNDGDKPKVQVNYK GESRSFFPEEISSMVLTKMKEIAEAYLGHPVT About GCG/SeqWeb: Input Sequences: since you all use the same account, the sequences you added stay together, which is confusing. One way to solve this is, for each alignment operation, put all input sequences (in FASTA format, including header line) to be aligned into one txt file (with at least one empty line between each sequence), and then use Add From Local File to load the sequences. This should lead to the newly added sequences be highlighted in the box. Don't change or click anything in the box, and directly click Run to run the alignment. Also pay attention to the sequence type (DNA or peptide) and choose the right service in the menu. What to submit: email to: ioschikhes-1@medctr.osu.edu For each of the seven exercises: 1,2.: The top four sequences (better from different species) in the search result, in FASTA format. Better in one txt file. 3.-6.: The alignment result, better in html format. 7.: The alignment results, better in html format. And analysis report. When to submit: Let's put it before my next class (1/29). But please check back for any update and let me know if you have any problems.