A Post-Genomics BioInformatics Survey . . . a whirlwind tour. Steve Thompson Florida State University School of Computational Science and Information Technology (CSIT) Valdosta State University BIOLOGY, CHEMISTRY, and GEOSCIENCES SEMINAR SERIES October 31, 2002 Introductory Overview: What is bioinformatics , genomics, sequence analysis, computational molecular biology . . . The Reverse Biochemistry Analogy. Using sequence analysis tools, one can infer all sorts of functional, evolutionary, and, perhaps, structural insight into a gene, without the need to isolate and purify massive amounts of protein! The computer is an essential part of this entire process. Definitions: Biocomputing and computational biology are fairly synonymous and both describe the use of computers and computational techniques to analyze biological systems. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available biological databases. Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within and across genomes. Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms. The exponential growth of molecular sequence databases & cpu power. Year BasePairs 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 680338 2274029 3368765 5204420 9615371 15514776 23800000 34762585 49179285 71947426 101008486 157152442 217102462 384939485 651972984 1160300687 2008761784 3841163011 11101066288 14396883064 Sequences 606 2427 4175 5700 9978 14584 20579 28791 39533 55627 78608 143492 215273 555694 1021211 1765847 2837897 4864570 10106023 13602262 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Database Growth (cont.) The Human Genome Project and numerous smaller genome projects have kept the data coming at alarming rates. As of August 2002, 91 complete, finished genomes are publicly available for analysis, not counting all the virus and viroid genomes available. The International Human Genome Sequencing Consortium announced the completion of a "Working Draft" of the human genome in June 2000; independently that same month, the private company Celera Genomics announced that it had completed the first assembly of the human genome. Both articles were published mid-February 2001 in the journals Science and Nature. Some neat stuff from the papers: We, Homo sapiens, aren’t nearly as special as we had hoped we were. Of the 3.2 billion base pairs in our DNA — Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25,000 and 35,000! The protein coding region of the genome is only about 1% or so, much of the remainder “junk” is “jumping,” “selfish DNA” of which much may be involved in regulation and control. 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! (Later shown to be not true by more extensive analyses, and to be due to gene loss rather than transfer.) What are these databases like? What are primary sequences? (Central Dogma: DNA —> RNA —> protein) Primary refers to one dimension — all of the “symbol” information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide. The symbols are the one letter alphabetic codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and structural information are not included within this sequence, however, much of this type of information is available in the reference documentation sections associated with primary sequences in the databases. What are sequence databases? These databases are an organized way to store the tremendous amount of sequence information that accumulates from laboratories worldwide. Each database has its own specific format. Three major database organizations around the world are responsible for maintaining most of this data; they largely ‘mirror’ one another. North America: National Center for Biotechnology Information (NCBI): GenBank & GenPept. Also Georgetown University’s NBRF Protein Identification Resource: PIR & NRL_3D. Europe: European Molecular Biology Laboratory (also EBI & ExPasy): EMBL & Swiss-Prot. Asia: The DNA Data Bank of Japan (DDBJ). Content & Organization: Most sequence databases are examples of complex ASCII/Binary databases, but usually are not Oracle or SQL or Object Oriented (proprietary ones often are). They contain several very long text files containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by providing index functions. Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise, although systems level commands can be used if one understands the data's structure. Nucleic acid databases are split into subdivisions based on taxonomy (historical). Protein databases are often organized into sections by level of annotation. What are other biological databases? Three dimensional structure databases: the Protein Data Bank and Rutgers Nucleic Acid Database. Still more; these can be considered ‘non-molecular’: Reference Databases: e.g. OMIM — Online Mendelian Inheritance in Man PubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. Phylogenetic Tree Databases: e.g. the Tree of Life. Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). Population studies data — which strains, where, etc. And then databases that most biocomputing people don’t even usually consider: e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates . . . . So how does one do Bioinformatics? Often on the InterNet over the World Wide Web: Site URL (Uniform Resource Locator) Content Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/ databases/analysis/software PIR/NBRF http://www-nbrf.georgetown.edu/ protein sequence database IUBIO Biology Archive http://iubio.bio.indiana.edu/ database/software archive Univ. of Montreal http://megasun.bch.umontreal.ca/ database/software archive Japan's GenomeNet http://www.genome.ad.jp/ databases/analysis/software European Mol' Bio' Lab' http://www.embl-heidelberg.de/ databases/analysis/software European Bioinformatics http://www.ebi.ac.uk/ databases/analysis/software The Sanger Institute http://www.sanger.ac.uk/ databases/analysis/software Univ. of Geneva BioWeb http://www.expasy.ch/ databases/analysis/software ProteinDataBank http://www.rcsb.org/pdb/ 3D mol' structure database Molecules R Us http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization The Genome DataBase http://www.gdb.org/ The Human Genome Project Stanford Genomics http://genome-www.stanford.edu/ various genome projects Inst. for Genomic Res’rch http://www.tigr.org/ esp. microbial genome projects HIV Sequence Database http://hiv-web.lanl.gov/ HIV epidemeology seq' DB The Tree of Life http://tolweb.org/tree/phylogeny.html overview of all phylogeny Ribosomal Database Proj’ http://rdp.cme.msu.edu/html/ databases/analysis/software WIT Metabolism http://wit.mcs.anl.gov/WIT2/ metabolic reconstruction Harvard Bio' Laboratories http://golgi.harvard.edu/ nice bioinformatics links list NCBI’s BLAST & Entrez, EMBL’s SRS, + GCG’s SeqLab and LookUp, phylogenetics . . . So what are the alternatives . . . ? Desktop software solutions — public domain programs are available, but . . . complicated to install, configure, and maintain. User must be pretty computer savvy. So, commercial software packages are available, e.g. Omiga, MacVector, DNAsis, DNAStar, etc., but . . . license hassles, big expense per machine, and Internet and/or CD database access all complicate matters! Therefore, UNIX server-based solutions (e.g. the Accelrys GCG Wisconsin Package [a Pharmacopeia Co.]): One commercial license fee for an entire institution and very fast, convenient database access on local server disks. Connections from any networked terminal or workstation anywhere! Operating system: UNIX command line operation hassles; communications software — telnet, ssh, xdmcp, etc. and terminal emulation; X graphics; file transfer — ftp, Mac Fetch, and scp/sftp; and editors — vi, emacs, pico (or desktop word processing followed by file transfer [save as "text only!"]). What about Homology? Inference through homology is a fundamental principle of all biology! What is homology — in this context it is similarity great enough such that common ancestry is implied. Walter Fitch, the famous molecular evolutionist, likes to relate the useful analogy “homology is like pregnancy, you either are or you’re not” — there’s no such thing as 65% pregnant! Pairwise Comparisons: The Dot Matrix Method. Provides a ‘Gestalt’ of all possible alignments between two sequences. Dynamic Programming. Heuristic Database Searching. Dot Matrix Analysis: RNA comparisons of the reverse, complement of a sequence to itself can often be very informative. The yeast phenylalanine tRNA sequence is compared to its reverse, complement using a 5 match within a of window of 7 stringency setting. The well known stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become obvious. They appear as clearly delineated diagonals running perpendicular to an imaginary main diagonal. 22 GAGCGCCAGACT G || | ||||| | A 48 CTGGAGGTCTAG A Base position 22 through position 33 base pairs with (think — is quite similar to the reverse-complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43. However the region around position 38 is represented as a loop. The actual modeled structure as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between. Pairwise Comparisons: Dynamic Programming. A ‘brute force’ approach just won’t work. The computation required to compare all possible alignments between two sequences requires time proportional to the product of the lengths of the two sequences, without considering gaps at all. If the two sequences are approximately the same length (N), this is a N2 problem. To include gaps, the calculation needs to be repeated 2N times to examine the possibility of gaps at each possible position within the sequences, now a N4N problem. Therefore, An optimal alignment is defined as an arrangement of two sequences, 1 of length i and 2 of length j, such that: 1) you maximize the number of matching symbols between 1 and 2; 2) you minimize the number of indels within 1 and 2; and 3 )you minimize the number of mismatched symbols between 1 and 2. Therefore, the actual solution can be represented by: Sij = sij + max Si-1 j-1 or max Si-x j-1 + wx-1 or 2<x<i max Si-1 j-y + wy-1 2<y<I Where Sij is the score for the alignment ending at i in sequence 1 and j in sequence 2, sij is the score for aligning i with j, wx is the score for making a x long gap in sequence 1, wy is the score for making a y long gap in sequence 2, allowing gaps to be any length in either sequence. An oversimplified example: total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}]) Optimum Alignments: There will probably be more than one best path through the matrix and none of them may be the biologically CORRECT alignment. Starting at the top and working down as we did, then tracing back, I found two optimum alignments: cTATAtAagg | ||||| cg.TAtAaT. cTATAtAagg | |||| cgT.AtAaT. Each of these solutions yields a trace-back total score of 22. This is the number optimized by the algorithm, not any type of a similarity or identity score! Even though one of these alignments has 6 exact matches and the other has 5, they are both optimal according to the rather strange criteria by which we solved the algorithm. This would not have occurred had we used a realistic gap penalty. Software will report only one of these solutions. Do you have any ideas about how others could be discovered? Answer — Often if you reverse the solution of the entire dynamic programming process, other solutions can be found! This was a global solution. Negative numbers in the match matrix and picking the best diagonal within overall graph provides a local solution. Significance: When is an Alignment Worth Anything Biologically? Monte Carlo simulations: Z score = [ ( actual score ) - ( mean of randomized scores ) ] ( standard deviation of randomized score distribution ) Many Z scores measure the distance from a mean using a simplistic Monte Carlo model assuming a normal distribution, in spite of the fact that ‘sequence-space’ actually follows what is know as an ‘extreme value distribution;’ however, the Monte Carlo method does approximate significance estimates pretty well. Histogram Key: Each histogram symbol represents 604 search set sequences Each inset symbol represents 21 search set sequences z-scores computed from opt scores z-score obs exp (=) (*) < 20 650 0:== 22 0 0: 24 3 0:= 26 22 8:* 28 98 87:* 30 289 528:* 32 1714 2042:===* 34 5585 5539:=========* 36 12495 11375:==================*== 38 21957 18799:===============================*===== 40 28875 26223:===========================================*==== 42 34153 32054:=====================================================*=== 44 35427 35359:==========================================================* 46 36219 36014:===========================================================* 48 33699 34479:======================================================== * 50 30727 31462:=================================================== * 52 27288 27661:=============================================* 54 22538 23627:====================================== * 56 18055 19736:============================== * 58 14617 16203:========================= * 60 12595 13125:=====================* 62 10563 10522:=================* 64 8626 8368:=============*= 66 6426 6614:==========* 68 4770 5203:========* 70 4017 4077:======* 72 2920 3186:=====* 74 2448 2484:====* 76 1696 1933:===* 78 1178 1503:==* 80 935 1167:=* 82 722 893:=* 84 454 707:=* 86 438 547:* 88 322 423:* 90 257 328:* 92 175 253:* :========= * 94 210 196:* :=========* 96 102 152:* :===== * 98 63 117:* :=== * 100 58 91:* :=== * 102 40 70:* :== * 104 30 54:* :==* 106 17 42:* :=* 108 14 33:* :=* 110 14 25:* :=* 112 12 20:* :* 114 9 15:* :* 116 6 12:* :* 118 8 9:* :* >120 1030 7:*= :*======================================= ‘Sequence-space’ actually follows the ‘extreme value distribution.’ Based on this known statistical distribution, and robust statistical methodology, a realistic Expectation function, the E value, can be calculated. The particulars of how BLAST and FastA do this differ, but the ‘takehome’ message is the same: The higher the E value is, the more probable that the observed match is due to chance in a search of the same size database and the lower its Z score will be, i.e. is NOT significant. Therefore, the smaller the E value, i.e. the closer it is to zero, the more significant it is and the higher its Z score will be! The E value is the number that really matters. These are the best hits, those most similar sequences with a Pearson zscore greater than 120 in this search. What about proteins — conservative replacements and similarity as opposed to identity, and similarity versus homology! Similarity is not automatically homology. Homology always means related by descent from a common ancestor. BLOSUM62 amino acid substitution matrix. Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919. A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 x Values whose magnitude is 4 are drawn in outline characters to make them easier to recognize. Notice that positive values for identity range from 4 to 11 and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity. Pairwise Comparisons: Database Searching. Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed to numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at the end are compiled and the longest alignment within the program’s restrictions is created. Hashing reduces the complexity of the search problem from N2 for dynamic programming to N, the length of all the sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database searching the heuristic restricts search space by calculating a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued. BLAST — Basic Local Alignment Search Tool, developed at NCBI. 1) Normally NOT a good idea to use for DNA against DNA searches w/o translation (not optimized); 2) Prefilters repeat and “low complexity” sequence regions; 4) Can find more than one region of gapped similarity; 5) Very fast heuristic and parallel implementation; FastA — and its family of relatives, developed by Bill Pearson at the University of Virginia. 1) Works well for DNA against DNA searches (within limits of possible sensitivity); 2) Can find only one gapped region of similarity; 3) Relatively slow, should usually be run in the background; 4) Does not require specially prepared, preformatted databases. 6) Restricted to precompiled, specially formatted databases; Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and proteinprotein searches. Translations done ‘on the fly’ for mixed searches. The algorithms: BLAST: Two word hits on the same diagonal above some similarity threshold triggers ungapped extension until the score isn’t improved enough above another threshold: the HSP. Initiate gapped extensions using dynamic programming for those HSP’s above a third threshold up to the point where the score starts to drop below a fourth threshold: yields alignment. Find all ungapped exact word hits; maximize the ten best continuous regions’ scores: init1. FastA: Combine nonoverlapping init regions on different diagonals: initn. Use dynamic programming ‘in a band’ for all regions with initn scores better than some threshold: opt score. What about multiple sequence alignment? Dynamic programming’s complexity increases exponentially with the number of sequences being compared: N-dimensional matrix . . . . complexity=[sequence length]number of sequences ‘Global’ heuristic solutions: See — MSA (‘global’ within ‘bounding box’) and PIMA (‘local’ portions only) on the multiple alignment page at the Baylor College of Medicine’s Search Launcher — http://searchlauncher.bcm.tmc.edu/ — but, severely limiting restrictions! Multiple Sequence Dynamic Programming: Therefore — pairwise, progressive dynamic programming restricts the solution to the neighbor-hood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners. Each group of partners is then aligned to finish the complete multiple sequence alignment. Web resources for pairwise, progressive multiple alignment: http://www.techfak.unibielefeld.de/bcd/Curric/MulAli/welcome.html. http://pbil.univ-lyon1.fr/alignment.html http://www.ebi.ac.uk/clustalw/ http://searchlauncher.bcm.tmc.edu/ However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll know it when you’re there! Reliability and the Comparative Approach: explicit homologous correspondence; manual adjustments based on knowledge, especially structural, regulatory, and functional sites. Therefore, editors like SeqLab and the Ribosomal Database Project: http://rdp.cme.msu.edu/html/. Structural & Functional correspondence in the Wisconsin Package’s SeqLab: Work with proteins! If at all possible: Twenty match symbols versus four, plus similarity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. Nucleotide sequences will only reliably align if they are very similar to each other. And they will require extensive hand editing and careful consideration. Complications: Beware of aligning apples and oranges [and grapefruit]! Parologous versus orthologous; genomic versus cDNA; mature versus precursor. Complications cont: Order dependence. Not that big of a deal. Substitution matrices and gap penalties. A very big deal! Regional ‘realignment’ becomes incredibly important, especially with sequences that have areas of high and low similarity (GCG’ PileUp -InSitu option). Format hassles! Specialized format conversion tools such as GCG’s From’ and To’ programs and PAUPSearch. Don Gilbert’s public domain ReadSeq program. Still more complications: Indels and missing data symbols (i.e. gaps) designation discrepancy headaches — ., -, ~, ?, N, or X . . . . . Help! The consensus and motifs: P-Loop Conserved regions can be visualized with a sliding window approach and appear as peaks. Let’s concentrate on the first peak seen here to simplify matters. A consensus isn’t necessarily the biologically “correct” combination. Therefore, build onedimensional ‘pattern descriptors.’ Motifs: PROSITE Database of protein ‘signatures’ — over 1,000 motifs. GHVDHGKS This motif, the P-loop, is defined: (A,G)x4GK(S,T), i.e. either an Alanine or a Glycine, followed by four of anything, followed by an invariant Glycine-Lysine pair, followed by either a Serine or a Threonine. But motifs can not convey any degree of the ‘importance’ of the residues. a multiple sequence alignment, how can we use all of the information Enter Given contained in it to find ever more remotely similar sequences, that is those “Twilight Zone” similarities below ~20% identity, those Z scores below ~5, those BLAST/Fast E values above ~10 or so? the Use a position specific, two-dimensional matrix where conserved areas of the Profile: alignment receive the most importance and variable regions hardly matter! -5 The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15. Profile Enhancements: PSI-BLAST uses profile methods to iterate and increase the sensitivity of database searches. Profiles can be statistically optimized with hidden Markov models. See Sean Eddy’s HMMer Package and the Pfam database. And profiles of motifs can even be discovered in unaligned and ‘unalignable’ sequences using Expectation Maximization. See Timothy Bailey’s MEME package. And what about Genomics? Easy — restriction digests and associated mapping; e.g. software like the Wisconsin Package’s Map, MapSort, and MapPlot. Harder — fragment assembly and genome mapping; such as packages from the University of Washington’s Genome Center (http://www.genome.washington.edu/), Phrep/Phrap/Consed (http://www.phrap.org/) and SegMap; and The Institute for Genomic Research’s (http://www.tigr.org/) Lucy and Assembler programs. Very hard — gene finding and sequence annotation. This is an incredibly difficult problem and is a primary focus of current genomics research. Easy— forward translation to peptides. Hard again — genome scale comparisons and analyses. Nucleic Acid Characterization: Recognizing Coding Sequences. Three general solutions to the gene finding problem: 1) all genes have certain regulatory signals positioned in or about them, 2) all genes by definition contain specific code patterns, 3) and many genes have already been sequenced and recognized in other organisms so we can infer function and location by homology if our new sequence is similar enough to an existing sequence. All of these principles can be used to help locate the position of genes in DNA and are often known as “searching by signal,” “searching by content,” and “homology inference” respectively. URFs and ORFs — definitions: URF: Unidentified Reading Frame — any potential string of amino acids encoded by a stretch of DNA. Any given stretch of DNA has potential URFs on any combination of six potential reading frames, three forward and three backward. ORF: Open Reading Frame — by definition any continuous reading frame that starts with a start codon and stops with a stop codon. Not usually relevant to discussions of genomic eukaryotic DNA, but very relevant when dealing with mRNA/cDNA or prokaryotic DNA. Signal Searching: locating transcription and translation affecter sites. One strategy — One-Dimensional Signal Recognition. Start Sites: Prokaryote promoter ‘Pribnow Box,’ TTGACwx{15,21}TAtAaT; Eukaryote transcription factor site database, TFSites.Dat; Shine-Dalgarno site, (AGG,GAG,GGA)x{6,9}ATG, in prokaryotes; Kozak eukaryote start consensus, cc(A,g)ccAUGg; AUG start codon in about 90% of genomes, exceptions in some prokaryotes and organelles. Signal Searching: locating transcription and translation affecter sites. One-Dimensional Approaches, cont. End Sites: ‘Nonsense’ chain terminating, stop codons, UAA, UAG, UGA; Eukaryote terminator consensus, YGTGTTYY; Eukaryote poly(A) adenylation signal, AAUAAA; but exceptions in some ciliated protists and due to eukaryote suppresser tRNAs. Signal Searching: locating transcription and translation affecter sites. Another Strategy — Two-Dimensional Weight Matrix. Exon/Intron Junctions. Donor Site Acceptor Site Exon Intron Exon A64G73G100T100A62A68G84T63 . . . 6Py74-87NC65A100G100N The splice cut sites occur before a 100% GT consensus at the donor site and after a 100% AG consensus at the acceptor site, but a simple consensus is not informative enough. Signal Searching: locating transcription and translation affecter sites. Two-Dimensional Weight Matrices describe the probability at each base position to be either A, C, U, or G, in percentages. The Donor Matrix. CONSENSUS from: Donor Splice site sequences from Stephen Mount NAR 10(2) 459;472 figure 1 page 460 Exon %G %A %U %C 20 30 20 30 9 40 7 44 cutsite 11 64 13 11 74 9 12 6 100 0 0 0 Intron 0 0 100 0 29 61 7 2 12 67 11 9 84 9 5 2 9 16 63 12 18 39 22 20 CONSENSUS sequence to a certainty level of 75 percent. VMWKGTRRGWHH The cut site is four bases away from the absolute GU! 20 24 27 28 Signal Searching: locating transcription and translation affecter sites. Two-Dimensional Weight Matrices, cont. The Acceptor Matrix. CONSENSUS of: Acceptor.Dat. IVS Acceptor Splice Site Sequences from Stephen Mount NAR 10(2); 459-472 figure 1 page 460 Intron cutsite Exon %G 15 22 10 10 10 6 7 9 7 5 5 24 1 0 100 52 24 19 %A 15 10 10 15 6 15 11 19 12 3 10 25 4 100 0 22 17 20 %T 52 44 50 54 60 49 48 45 45 57 58 30 31 0 0 8 37 29 %C 18 25 30 21 24 30 34 28 36 35 27 21 64 0 0 18 22 32 to a CONSENSUS position: sequence certainty level of 75.0 percent at each BBYHYYYHYYYDYAGVBH The cut site is fifteen bases away from the absolute AG! Signal Searching: locating transcription and translation affecter sites. Two-Dimensional Weight Matrices, cont. The CCAAT site — occurs around 75 base pairs upstream of the start point of eukaryotic transcription, may be involved in the initial binding of RNA polymerase II. Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578. Preferred region: %G %A %U %C 7 32 30 31 25 18 27 30 motif within -212 to -57. 14 14 45 27 40 58 1 1 57 29 11 3 1 0 1 99 Optimized cut-off value: 0 0 0 100 1 0 99 0 12 68 15 5 9 10 82 0 87.2%. 34 13 2 51 30 66 1 3 CONSENSUS sequence to a certainty level of 68 percent at each position: HBYRRCCAATSR Signal Searching: locating transcription and translation affecter sites. Two-Dimensional Weight Matrices, cont. The TATA site (aka “Hogness” box) — a conserved A-T rich sequence found about 25 base pairs upstream of the start point of eukaryotic transcription, may be involved in positioning RNA polymerase II for correct initiation and binds Transcription Factor IID. Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578. Preferred region: center between -36 and -20. Optimized cut-off value: %G 39 5 1 1 1 0 5 11 40 %A 16 4 90 1 91 69 93 57 40 %U 8 79 9 96 8 31 2 31 8 %C 37 12 0 3 0 0 1 1 11 39 33 33 33 36 14 21 21 21 17 12 8 13 16 19 35 38 33 30 28 79%. 36 20 18 26 CONSENSUS sequence to a certainty level of 61 percent at each position: STATAWAWRSSSSSS Signal Searching: locating transcription and translation affecter sites. Two-Dimensional Weight Matrices, cont. The GC box — may relate to the binding of transcription factor Sp1. Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578. Preferred region: %G %A %U %C 18 37 30 15 41 35 12 11 motif within -164 to +1. 56 75 100 18 24 0 23 0 0 2 0 0 99 1 0 0 Optimized cut-off value: 88%. 0 82 81 62 70 13 19 40 20 17 0 29 8 0 7 15 18 1 18 9 15 27 42 37 62 0 1 0 6 61 31 9 CONSENSUS sequence to a certainty level of 67 percent at each position: WRKGGGHGGRGBYK Signal Searching: locating transcription and translation affecter sites. Two-Dimensional Weight Matrices, cont. The cap signal — a structure at the 5’ end of eukaryotic mRNA introduced after transcription by linking the 5’ end of a guanine nucleotide to the terminal base of the mRNA and methylating at least the additional guanine; the structure is 7MeG5’ppp5’Np. Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578. Preferred region: %G %A %U %C 23 16 45 16 center between 1 and +5. Optimized cut-off value: 0 0 0 100 0 95 5 0 38 9 26 27 0 25 43 31 15 22 24 39 24 15 33 28 81.4%. 18 17 33 32 CONSENSUS sequence to a certainty level of 63 percent at each position: KCABHYBY Signal Searching: locating transcription and translation affecter sites. Two-Dimensional Weight Matrices, cont. The eukaryotic terminator weight matrix. Base freguencies according to McLauchlan et al. (1985) N.A.R. 13:1347-1368. Found in about 2/3's of all eukaryotic gene sequences. %G %A %U %C 19 13 51 17 81 9 9 1 9 3 89 0 94 3 3 0 14 4 79 3 10 0 61 29 11 11 56 21 19 13 47 21 CONSENSUS sequence to a certainty level of 68 percent at each position: BGTGTBYY Content Approaches: Strategies for finding coding regions based on the content of the DNA itself. Searching by content utilizes the fact that genes necessarily have many implicit biological constraints imposed on their genetic code. This induces certain periodicities and patterns to produce distinctly unique coding sequences; non-coding stretches do not exhibit this type of periodic compositional bias. These principles can help discriminate structural genes in two ways: 1) based on the local “non-randomness” of a stretch, and 2) based on the known codon usage of a particular life form. The first, the non-randomness test, does not tell us anything about the particular strand or reading frame; however, it does not require a previously built codon usage table. The second approach is based on the fact that different organisms use different frequencies of codons to code for particular amino acids. This does require a codon usage table built up from known translations; however, it also tells us the strand and reading frame for the gene products as opposed to the former. Content Approaches, cont. “Non-Randomness” Techniques; e.g. TestCode. Relies solely on the base compositional bias of every third position base. The plot is divided into three regions: top and bottom areas predict coding and noncoding regions, respectively, to a confidence level of 95%, the middle area claims no statistical significance. Diamonds and vertical bars above the graph denote potential stop and start codons respectively. Content Approaches, cont. Codon Usage Techniques; e.g CodonPreference. Genomes use synonymous codons unequally sorted phylogenetically. Each forward reading frame indicates a red codon preference curve and a blue third position GC bias curve. The horizontal lines within each plot are the average values of each attribute. Start codons are represented as vertical lines rising above each box and stop codons are shown as lines falling below the reading frame boxes. Rare codon choices are shown for each frame with hash marks below each reading frame. Homology Inference: Similarity searching can be particularly powerful for inferring gene location by homology. This can often be the most informative of any of the gene finding techniques, especially now that so many sequences have been collected and analyzed. The alignments from database searches can pinpoint the locations of genes, especially if between an unknown genomic query sequence and a known database cDNA sequence. But this too can be misleading and seldom gives exact start and stop and exon splice site positions. World Wide Web Servers for Gene Finding. Many servers have been established that can be a huge help with gene finding analyses. Most of these servers combine many of the methods discussed above but they consolidate the information and often combine signal and content methods with homology inference in order to ascertain exon locations. Many use powerful neural net or artificial intelligence approaches to assist in this difficult ‘decision’ process. A wonderful bibliography on computational methods for gene recognition has been compiled at Rockefeller University (http://linkage.rockefeller.edu/wli/gene/), and the Baylor College of Medicine’s Gene Search (http://searchlauncher.bcm.tmc.edu/seq-search/genesearch.html) also offers several gene finding tools. World Wide Web Gene Finders, cont. Five popular gene-finding services are GrailEXP, GeneId, GenScan, NetGene2, and GeneMark. The neural net system GrailEXP (Gene recognition and analysis internet link– EXPanded http://grail.lsd.ornl.gov/grailexp/) is a gene finder, an EST alignment utility, an exon prediction program, a promoter and polyA recognizer, a CpG island locater, and a repeat masker, all combined into one package. GeneId (http://www1.imim.es/software/geneid/index.html) is an ‘ab initio’ Artificial Intelligence system for predicting gene structure optimized in genomic Drosophila or Homo DNA. NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/), another ‘ab initio’ program, predicts splice site likelihood using neural net techniques in human, C. elegans, and A. thaliana DNA. GenScan (http://genes.mit.edu/GENSCAN.html) is perhaps the most ‘trusted’ server these days with vertebrate genomes. The GeneMark (http://opal.biology.gatech.edu/GeneMark/) family of gene prediction programs is based on Hidden Markov Chain modeling techniques; originally developed in a prokaryotic context the programs have now been expanded to include eukaryotic modeling as well. The combinatorial approach. Get all your data in one place. GCG’s SeqLab is a great way to do this due to its advanced annotation capabilities: Beyond finding genes: Genome scale analyses. Unfortunately much ’traditional’ sequence analysis software can’t do it, but there are some very good Web resources available for these types of ‘global view’ analyses. Let’s run through a few examples. NCBI’s Genome pages (http://www.ncbi.nlm.nih.gov/) present a good starting point in North America: Beyond finding genes: Genome scale analyses, cont. That can lead to neat places like the Genome Browser at the University of California, Santa Cruz (http://genome.ucsc.edu/) and the Ensembl project at the Sanger Center for BioInformatics (http://www.ensembl.org/): Beyond finding genes: Genome scale analyses, cont. And sites like the the University of Wisconsin’s E. coli Genome Project (http://www.genome.wisc.edu/) and The Institute for Genomic Research’s (http://www.tigr.org/) MUMMER package. Structural Inference: Secondary structure can be reliably predicted in many cases. See http://www.emblheidelberg.de/predictprotein /predictprotein.html, which uses multiple sequence alignment profile techniques along with neural net technology. Even three-dimensional “homology modeling” will often lead to remarkably accurate representations if the similarity is great enough between your protein and one in which the structure has been solved through experimental means. See SwissModel at http://www.expasy.ch/swissmod/ SWISS-MODEL.html. Phylogenetic Inference: Evolutionary relationships can be ascertained using a multiple sequence alignment and the methods of molecular phylogenetics. See e.g. the PAUP* and PHYLIP software packages. And if you’re really interested in this topic check out the Workshop on Molecular Evolution offered every August at the Woods Hole Marine Biological Laboratory and/or similar courses worldwide. Conclusions: Gunnar von Heijne in his dated but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion: “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” He continues: “. . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.” FOR MORE INFO... See the listed references and WWW sites. Contact FSU’s CSIT (http://www.csit.fsu.edu/) for general questions and me (stevet@bio.fsu.edu) for specific bioinformatics assistance and/or collaboration. References and a Comment: You have been exposed to a perplexing variety of bioinformatics techniques today. As in most of the biological sciences, the better you understand the chemical, physical, and biological systems involved, the better your chance of success in analyzing them. Certain strategies are inherently more appropriate to others in certain circumstances. Making these types of subjective, discriminatory decisions and utilizing all of the available options so that you can generate the most practical data for evaluation are two of the most important ‘take-home’ messages that I can offer! Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Research 25, 3389-3402. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018. Bucher, P. (1990). Weight Matrix Descriptions of Four Eukaryotic RNA Polymerase II Promoter Elements Derived from 502 Unrelated Promoter Sequences. Journal of Molecular Biology 212, 563-578. Bucher, P. (1995). The Eukaryotic Promoter Database EPD. EMBL Nucleotide Sequence Data Library Release 42, Postfach 10.2209, D-6900 Heidelberg, Germany. Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A. Genetics Computer Group (GCG) (Copyright 1982-2002) Program Manual for the Wisconsin Package, Version 10.3, Accelrys, Inc. A Pharmocopeia Company, San Diego, California, U.S.A. Ghosh, D. (1990). A Relational Database of Transcription Factors. Nucleic Acids Research 18, 1749-1756. Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, New York, U.S.A. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358. References (cont.): Hawley, D.K. and McClure, W.R. (1983). Compilation and Analysis of Escherichia coli promoter sequences. Nucleic Acids Research 11, 2237-2255. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A. 89, 10915-10919. Kozak, M. (1984). Compilation and Analysis of Sequences Upstream from the Translational Start Site in Eukaryotic mRNAs. Nucleic Acids Research 12, 857-872. McLauchen, J., Gaffrey, D., Whitton, J. and Clements, J. (1985). The Consensus Sequences YGTGTTYY Located Downstream from the AATAAA Signal is Required for Efficient Formation of mRNA 3’ Termini. Nucleic Acid Research 13 , 1347-1368. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, 443-453. Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. Nucleic Acids Research 22, 3470-3473. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A. 85, 2444-2448. Proudfoot, N.J. and Brownlee, G.G. (1976). 3’ Noncoding Region in Eukaryotic Messenger RNA. Nature 263, 211214. Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 232, 584-599. Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. CABIOS, 10, 671-675. Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482489. References (cont.): Stormo, G.D., Schneider, T.D. and Gold, L.M. (1982). Characterization of Translational Initiation Sites in E. coli. Nucleic Acids Research 10, 2971-2996. Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids Research 10, 2471-2484. Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony, and Other Methods) (2002) Version 4, distributed by Sinauer Associates, Sunderland, Massachusetts, U.S.A. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680. von Heijne, G. (1987a) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. von Heijne, G. (1987b). SIGPEP: A Sequence Database for Secretory Signal Peptides. Protein Sequences & Data Analysis 1, 41-42. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences U.S.A. 80, 726-730. Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, 48-52.