Learning objectives for Sequence Analysis 1

advertisement
Learning Objectives for Sequence Analysis Lecture.
by Dr. Ilya Ioshikhes,
Department of Biomedical Informatics, 3017 Graves Hall,
Tel. 292-6514, E-mail: ioschikhes-1@medctr.osu.edu
Building blocks of DNA and RNA.
1. What are the building blocks of DNA?
2. What nucleotides are involved in DNA and RNA structure?
3. How the information is transferred from DNA to proteins?
Building blocks of proteins.
4. What are the building blocks of proteins?
5. What is the primary factor determining a protein’s shape and structure?
DNA and proteins (polypeptides) as sequences.
6. What components are needed to build a sequence?
7. How DNA and proteins considered in sequence analysis?
Basic approaches in sequence analysis.
8. What are the basic approaches to compare two sequences?
9. What are global and local alignment, and what is the difference?
10. In which cases do we need to compare entire sequences, and which – their
segments?
11. Which kind of alignment is usually more useful for gene detection?
Algorithms and software for pair-wise sequence comparisons.
12. What are the most popular algorithms you know?
13. How are sequences with mismatches compared?
14. What algorithm would you use to find regions conserved between two proteins?
Multiple sequence alignment (MSA) and molecular evolution.
15. What is the difference between the MSA and pair-wise sequence alignment?
16. Explain relationship of molecular evolution and MSA.
17. How sequence mutations are represented in MSA?
18. What are basic types of sequence mutations?
19. How pair-wise sequence alignment is used in MSA?
Basic approaches in MSA.
20. What are the basic steps necessary for comparison of multiple sequences?
21. Explain difference between progressive and iterative approaches for MSA.
22. In which cases do we need to compare entire sequences, and which – their
segments?
23. Which of the described approaches of the MSA are based on global, and which –
on local alignment?
24. What kind of alignment is more useful for finding of conserved regions in protein
sequences?
Algorithms and software for comparisons of multiple sequences.
25. What are the most popular algorithms you know?
26. Explain basic idea of the MSA scoring.
27. What algorithm would you use to find regions conserved between several
homologous genes?
Searching of databases for similar sequences.
28. What is the most common type of database similarity search?
29. Why strict algorithms of the pair-wise sequence comparison are not the best when
we want to compare query sequences with a large database?
30. Which type of the pair-wise comparison is more useful for the database searches?
31. What algorithms of the database searches do you know?
32. Try to explain principles of their work.
Key information elements.
Biological background for sequence analysis.
1. There are four nucleotides (A,C,G,T) serving as building blocks for DNA
molecules, and 20 amino acids serving as building blocks for proteins.
2. Genes are DNA segments encoding information for synthesis of proteins.
3. Triplet code governs transmission of the information from genes to proteins: 3
nucleotides encode 1 amino acid.
4. The two major steps of the information transmission are transcription (synthesis
of mRNA from DNA) and translation (synthesis of proteins from mRNA). There
are also other stages in this process (splicing etc.)
5. DNA is a double stranded molecule. For most of the purposes of sequence
analysis, however, knowledge of one strand is sufficient, because another one
may be restored by the rules of complementarity (A is complement to T, and C –
to G).
6. The shape (structure) of a protein molecule is primarily determined by its amino
acid sequence.
7. Sequence is the order of the constituent subunits of a large biological
molecule, for example, the order of amino acids in a protein or the order of
nucleotides in amino acids. Two components compose the sequence: its
subunits (building blocks) and their order.
8. DNA molecules are sequences of nucleotides. Protein molecules (polypeptides)
are sequences of amino acids. Properties of these molecules are largely defined by
their sequences, so to study and compare the molecules, we must study and
compare their sequences.
Basic terms and approaches for sequence analysis.
9. Sequence comparison starts from comparison of two sequences (pair-wise
comparison).
10. Basic approaches for the pair-wise comparison: 1) dot matrix, 2) exhaustive
alignment including all possible combinations (practically infeasible), 3) dynamic
pair-wise alignment (letter by letter) and 4) alignment by word methods.
11. Global and local alignments. In global alignment, the entire sequences are
aligned, using as many characters as possible, up to both ends of each sequence.
In local alignment, the sequence segments with the highest density of matches are
aligned.
12. Global alignment is best suited for quite similar sequences. Local alignment is
best suited for alignment of sequences with only local similarity.
13. Optimal alignment is one with the best total score of possible matches,
mismatches and gaps (insertions and deletions). The score is typically a sum of
amino acid (or nucleotide) pair scores minus penalties for gaps (their opening and
length). Special matrices are used for the pair scoring.
14. Dynamic programming is progressive building of an alignment by comparing two
residues at a time, moving through all matching positions from one end of each
sequence (segment) to another with scoring each point; alignment with the highest
score is chosen.
15. For scoring of an alignment, special matrices are used: Dayhoff Amino Acid
Substitution Matrices (Percent Accepted Mutation or PAM Matrices) – list
the likelihood of change from one amino acid to another in homologous protein
sequences during evolution, for a certain period of evolutionary time. Blocks
Amino Acid Substitution Matrices (BLOSUM) by Henikoff and Henikoff –
based one the observed amino acid substitutions in a large set of ~2000 conserved
amino acid patterns, called blocks.
16. In word methods, sequences are broken down into short words, and combinations
of the words are further compared to find similar regions. Used mostly for
database searches.
17. The most popular software for pair-wise sequence comparisons: 1) DotPlot and
Compare (dot matrix approach); 2) GAP (global alignment, Needleman-Wunsch
dynamic algorithm); 3) BestFit (local alignment, Smith-Waterman dynamic
algorithm); 4) LALIGN (finding multiple unique nonintersecting local
alignments); 5) FASTA and BLAST (word algorithms for database searches).
18. Following scheme is helpful for resolving of the problems of pair-wise sequence
comparison:
Multiple sequence alignment (MSA)
19. DNA and protein sequences of different organisms are often related. Genes with
similar function are conserved across widely divergent species.
20. Through simultaneous alignment of the sequences of these genes, sequence
patterns that have been subject to conservation and alteration may be analyzed.
21. In MSA, sequences are aligned optimally by bringing the greatest number of
similar characters into register in the same column of the alignment, just as for the
alignment of two sequences.
22. Finding of optimal MSA of more than two sequences, including matches,
mismatches, and gaps, poses a very difficult challenge. Algorithms used for
optimal alignment of pairs of sequences can be extended to three sequences, but
for more than three sequences, only a small number of relatively short sequences
may be analyzed.
23. Thus, approximate heuristic methods are used, including (1) a progressive global
alignment of the sequences starting with an alignment of the most alike sequences
and then building an alignment by adding more sequences, (2) iterative methods
that make an initial alignment of groups of sequences and then revise the
alignment to achieve a more reasonable result, (3) alignments based on locally
conserved patterns found in the same order of the sequences.
24. MSA calculates the multiple alignment score by adding the scores of the
corresponding pair-wise alignments in the MSA.
25. Optimizing the MSA is achieved by optimizing the score, that is by maximizing
the number of matched pairs (or minimizing the cost or number of mismatched
pairs) summed over all columns in the MSA.
26. The MSA of a set of sequences may also be viewed as an evolutionary history of
the sequences.
27. MSA of a set of sequences can provide information as to the most alike regions in
the set. In DNA, conserved regions may possibly be inherited from the ancestor
sequence. In proteins, such regions may represent conserved functional or
structural domains.
28. CLUSTAL, one of the popular computer programs for progressive MSA,
performs a global-multiple sequence alignment by following steps: (1) Perform
pair-wise alignments of all of the sequences; (2) use the alignment scores to
produce a phylogenetic tree; and (3) align the sequences sequentially, guided by
the phylogenetic relationships indicated by the tree.
29. Following scheme is helpful for resolving of the problems of multiple sequence
comparison:
Database similarity searches.
30. With availability of several complete genomes (including Human genome) and
other sequence data, there is an ample information as to the biological function of
particular sequences in model organisms that may be exploited to predict the
function of similar genes in other organisms.
31. In the database searches, the sequence of the gene or protein of interest is
compared to every sequence in a sequence database, and the similar ones are
identified. Alignments with the best-matching sequences are shown and scored. If
a query sequence can be readily aligned to a database sequence of known
function, structure, or biochemical activity, the query sequence is predicted to
have the same function, structure or biochemical activity.
32. Efficiency of the predictions depends on the quality of the alignment between the
sequences. Roughly, if more than one-half of the amino acid sequence of query
and database proteins, or appropriate number of nucleotides for DNA is identical
in the sequence alignments, the prediction is very strong.
33. There are different possible kinds of database searches: (1) a standard search of a
sequence database with a query sequence; (2) search for a characteristic sequence
pattern in database or query.
34. Searching a sequence database for sequences that are similar to a query sequence
is the most common type of database similarity search. The search provides a list
of database sequences with which the query sequence can be aligned, with
statistical evaluation of the alignment scores.
35. Number of the necessary database searches and large size of the sequence
databases made necessary development of algorithms for pair-wise sequence
comparisons that are more quick than strict algorithms for the pair-wise
alignment.
36. Heuristic (tried-and-true) methods of database searches (FASTA and BLAST)
work at least 50 times faster than the strict algorithms.
37. These are word methods, comparing the sequences by short common patterns in
the query and database sequences and joining the patterns into an alignment.
38. BLAST gained a further increase in speed by searching only for more significant
patterns in nucleic acid and protein sequences. BLAST is very popular due to
availability of the program on the World Wide web through a large server at the
National Center for Biotechnology Information (NCBI) (http://ncbi.nlm.nih.gov)
and at many other sites.
39. With the more recent increased speed and size of computers and algorithmic
improvement, Smith-Waterman full sequence alignment also may be used for the
database searches. It is 50-fold or more slower than FASTA and BLAST, but is
able to find more distantly related sequences.
40. Following scheme represents typical stages of sequence analysis involving
database searches:
The 1/20/04 Home Assignment:
1. Find protein homologs to the protein query sequence (alpha-fetoprotein precursor
from Homo Sapiens) using BLASTP program (http://www.ncbi.nlm.nih.gov/).
2. Find DNA homologs to the same query sequence using TBLASTN.
3. Perform a global pairwise sequence alignment of the protein query sequence from
1. with the top homolog found in 1. using GAP (SeqWeb program from
http://gene.med.ohio-state.edu/gcg-bin/seqweb.cgi)
4. Find the segments of best similarity between top DNA homologs found in 2.
using local optimal sequence alignment program BestFit of SeqWeb.
5. Make multiple sequence alignment of the top protein homologs found in 1. using
progressive pairwise alignments (program PileUp of SeqWeb).
6. Find a consensus sequence based on results obtained in 5. using PRETTY
program of SeqWeb.
7. Using programs for pair-wise and multiple sequence alignment and database
searches, perform a comparative analysis of similarity of conserved segments in
promoters and coding regions of genes, most similar to a gene encoding entire
protein with a given fragment:
>query
MAKNTAIGIDLGTTYSCVGVFQHGKVEIIANDQGNRTTPSYVAFTDTERLIGDA
AKNQVALNPQNTVFDAKRLIGRKFGDAVVQSDMKHWPFQVVNDGDKPKVQVNYK
GESRSFFPEEISSMVLTKMKEIAEAYLGHPVT
About GCG/SeqWeb:
Input Sequences: since you all use the same account, the sequences you added stay
together, which is confusing. One way to solve this is, for each alignment operation,
put all input sequences (in FASTA format, including header line) to be aligned into
one txt file (with at least one empty line between each sequence), and then use Add
From Local File to load the sequences. This should lead to the newly added
sequences be highlighted in the box. Don't change or click anything in the box, and
directly click Run to run the alignment. Also pay attention to the sequence type
(DNA or peptide) and choose the right service in the menu.
What to submit: email to: ioschikhes-1@medctr.osu.edu
For each of the seven exercises:
1,2.: The top four sequences (better from different species) in the search result, in
FASTA format. Better in one txt file.
3.-6.: The alignment result, better in html format.
7.: The alignment results, better in html format. And analysis report.
When to submit:
Let's put it before my next class (1/29). But please check back for any update and let
me know if you have any problems.
Download