Computational Biology Lecture (2) Biological Databases By Dr. Safynaz AbdEl-Fattah Computer Science Human Genome Project From • In 1990, the Human Genome Project (HGP) began. • the HGP was led by an international team of researchers looking to sequence of human and map all of the genes together. • Completed in April 2003, • for the first time, the HGP gave us the ability to read nature's complete genetic blueprint for building a human being, • by publishing the sequence of • the entire genome's three billion base pairs. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html . . . . 3 4 Bioinformatics consists of two subfields ▪ ▪ Development of computational tools& databases. Application of these tools& databases in generating biological knowledge. These two subfields are complementary to each other. The tool development includes writing software for sequence, structural, and functional analysis, as well as the construction of biological databases. The analyses of biological data often generate new problems and challenges that in turn encourage the development of new and better computational tools. 5 Computational Tools & Databases Developers ‘Bioinformaticians’ o Computer scientists o Software engineers o Mathematicians o Statisticians o Computational biologists Development maintenance Application Development & maintenance Computational tools 6 Databases Application Users • Biology Researchers • Biology Scientists The computational tools are used in three areas of genomic and molecular biological research: • Molecular sequence analysis sequence alignment • Molecular structural analysis • sequence database searching motif and pattern discovery gene and promoter finding reconstruction of evolutionary relationships genome assembly and comparison nucleic acid structure prediction protein structure prediction protein structure comparison protein structure classification Molecular functional analysis gene expression profiling protein–protein interaction prediction protein subcellular localization prediction, metabolic pathway reconstruction, and simulation The three areas of bioinformatics analysis are not isolated but often interact to produce integrated results For example, protein structure prediction depends on sequence alignment7 data. Limitations of bioinformatics ▪ Relying on intelligence can be dangerous if the intelligence is of limited accuracy. Therefore, poor-quality intelligence can yield costly mistakes if not complete failures. ▪ Bioinformatics predictions are not formal proofs of any concepts. They do not replace the traditional experimental research methods of testing hypotheses. ▪ The quality of bioinformatics predictions depends on the quality of data and the sophistication of the algorithms being used. ▪ Sequence data from high throughput analysis often contains errors. If the sequences are wrong or annotations are incorrect, the results from the downstream analysis are misleading as well. 8 Limitations of bioinformatics ▪ ▪ ▪ ▪ ▪ ▪ ▪ The outcome of computation also depends on the computing power available. Many accurate but exhaustive algorithms cannot be used because of the slow rate of computation. Instead, less accurate but faster algorithms have to be used. This is a necessary tradeoff between accuracy and computational feasibility. Therefore, it is important to keep in mind the potential for errors produced by bioinformatics programs. Caution should always be exercised when interpreting prediction results. It is a good practice to use multiple programs, if they are available, and perform multiple evaluations. A more accurate prediction can often be obtained if one draws a consensus by comparing results from different algorithms. 9 Databases ▪ One of the hallmarks of modern genomic research is the generation of huge amounts of raw sequence data. As the volume of genomic data grows, sophisticated computational methodologies are required to manage this huge data. ▪ Thus, the first challenge in the genomics era is to store and handle the huge volume of information through the establishment and use of computer databases. The development of databases to handle the vast amount of molecular biological data is thus a fundamental task of bioinformatics. ▪ ▪ ▪ Although data retrieval is the main purpose of all databases, biological databases often have a higher level of requirement, known as knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. For example, databases containing raw sequence information can perform extra computational tasks to identify sequence homology or conserved motifs. These features facilitate the discovery of new biological insights from raw data. 10 BIOLOGICAL DATABASES Based on database structures: 1. Flat files 2. Relational database 3. Object oriented. Based on contents, 1. Primary databases 2. Secondary databases 3. Specialized databases. 11 Types of Biological Databases Based Structures: ▪ Current biological databases use all three types of database structures: flat files, relational, and object oriented. Despite the obvious drawbacks of using flat files in database management, many biological databases still use this format. The justification for this is that this system involves minimum amount of database design and the search output can be easily understood by working biologists. • Flat files Databases: flat file format is a long text file that contains many entries separated by a delimiter, a special character such as a vertical bar (|). Within each entry are number of fields separated by tabs or commas. The entire text file does not contain any hidden instructions for computers to search for specific information or to create reports based on certain fields from each record. The text file can be considered a single table. Thus, to search for a piece of information, a computer has to read through the entire file, which inefficient process. As database size increases or data types become more complex, this database style can become very difficult for information retrieval. • To facilitate the efficient access and retrieval of data, sophisticated computer software programs for organizing, searching, and accessing data have been developed. They are called database management systems. Depending on the types of data structures, these database management systems can be classified into: relational database management systems and object-oriented database management systems. Consequently, databases employing these management systems are known as relational databases or object12 oriented databases, respectively. • Relational Databases: relational databases use a set of tables to organize data. Each table, is made up of columns and rows. Columns represent individual fields. Rows represent values in the fields of records. The columns in a table are indexed according to a common feature called an attribute, so they can be cross-referenced in other tables. Different tables can be linked by common data categories, which facilitate finding of specific information. To execute a query in a relational database, the system selects linked data items from different tables and combines the information into one report. Therefore, specific information can be found more quickly from a relational database than from a flat file database. • Object-Oriented Databases: object-oriented databases have been developed that store data as objects. In an object-oriented programming language, an object can be considered as a unit that combines data and mathematical routines that act on the data. The database is structured such that the objects are linked by a set of pointers defining predetermined relationships between the objects. Searching the database involves navigating through the objects with the aid of the pointers linking different objects. Programming languages like C++ are used to create object-oriented databases. 13 Types of Biological Databases Based on Contents ▪ Biological databases can be roughly divided into three categories: primary databases, secondary databases, and specialized databases, based on their contents. ▪ Primary databases contain original biological data. They are archives of raw sequence or structural data submitted by the scientific community. GenBank and Protein Data Bank (PDB) are examples of primary databases. ▪ Secondary databases contain computationally processed or manually curated information, based on original information from primary databases. Translated protein sequence databases containing functional annotation belong to this category. Examples are SWISS-Prot and Protein Information Resources (PIR) ▪ Specialized databases are those that serve a particular research interest. For example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data. 14 Primary Databases ▪ There are three major public sequence databases that store raw nucleic acid sequence data produced and submitted by researchers worldwide: GenBank, the European Molecular Biology Laboratory (EMBL) database and the DNA Data Bank of Japan (DDBJ), which are all freely available on the Internet. Most of the data in the databases are contributed directly by authors with a minimal level of annotation. ▪ These public databases closely collaborate and exchange new data daily. They together constitute the International Nucleotide Sequence Database Collaboration. By connecting to any one of the three databases, one should have access to the same nucleotide sequence data. Although the three databases contain the same sets of raw data, each has a slightly different kind of format to represent the data. ▪ There is only one centralized database, the Protein Data Bank (PDB), for the three-dimensional structures of biological macromolecules. This database archives atomic coordinates of macromolecules (both proteins and nucleic acids) determined by x-ray crystallography. It uses a flat file format to represent protein name, authors, experimental details, secondary structure, and atomic coordinates. The web interface of PDB also provides viewing tools for simple image 16 manipulation. Secondary databases ▪ Secondary databases contain computationally processed sequence information derived from the primary databases. The amount of computational processing work varies greatly among the secondary databases; some are simple archives of translated sequence data from identified open reading frames in DNA, whereas others provide additional annotation and information related to higher levels of information regarding structure and functions. ▪ SWISS-PROT is an example of secondary databases, which provides detailed sequence annotation that includes structure, function, and protein family assignment. The sequence data are mainly derived from TrEMBL, a database of translated nucleic acid sequences stored in the EMBL database. ▪ The protein annotation includes function, domain structure, disease association, and similarity with other sequences. The annotation provides significant added value to each original sequence record. The data record also provides cross-referencing links to other online resources of interest. Other features such as very low redundancy and high level of integration with other primary and secondary databases make SWISS-PROT very popular among biologists. 18 Specialized databases ▪ Specialized databases focus on a specific research community or a particular organism. The content of these databases may be sequences or other types of information. The sequences in these databases may overlap with a primary database. Many genome databases that are taxonomic specific fall within this category. ▪ Examples of specialized databases are Flybase, WormBase, AceDB, and TAIR. In addition, there are also specialized databases that contain original data derived from functional analysis. For example, Microarray Gene Expression Database at the European Bioinformatics Institute (EBI) are some of the gene expression databases available. 19 Databases and Retrieval Systems Brief Summary of Content URL Primary databases GenBank EMBL DDBJ Primary nucleotide sequence database in NCBI Primary nucleotide sequence database in Europe Primary nucleotide sequence database in Japan www.ncbi.nlm.nih.gov/Genbank www.ebi.ac.uk/embl/index.html www.ddbj.nig.ac.jp Secondary databases SWISS-Prot PIR Curated protein sequence database Annotated protein sequences www.ebi.ac.uk/swissprot/access.html https://proteininformationresource.org/pirwww/ Specialized databases FlyBase A database of the Drosophila genome https://flybase.org/ TAIR Arabidopsis information database www.arabidopsis.org AceDB Genome database for Caenorhabditis elegans https://dbdb.io/db/acedb HIV databases HIV sequence data and related immunologic information Ribosomal RNA sequences and phylogenetic trees derived from the sequences www.hiv.lanl.gov/content/index DNA microarray data and analysis tools www.ebi.ac.uk/microarray Ribosomal database project Microarray gene expression database http://rdp.cme.msu.edu/html 20 Interconnection between Biological Databases ▪ As mentioned, primary databases are central repositories and distributors of raw sequence and structure information. They support nearly all other types of biological databases. Therefore, there is a need for the secondary and specialized databases to connect to the primary databases and to keep uploading sequence information. In addition, a user often needs to get information from both primary and secondary databases. Instead of letting users visiting multiple databases, it is suitable for entries in a database to be crossreferenced and linked to related entries in other databases that contain additional information. All these create a demand for linking different databases. ▪ The main barrier to linking different biological databases is format incompatibility current biological databases utilize all three types of database structures – flat files, relational, and object oriented. The heterogeneous database structures limit communication between databases. One solution to networking the databases is to use a specification language, which allows database programs at different locations to communicate in a network through an “interface broker” without having to understand each other’s database structure. 22 Pitfalls of Biological Databases There are many errors in sequence databases. ▪ There are also high levels of redundancy in the primary sequence databases. ▪ Annotations of genes can be false or incomplete. All these types of errors can be passed on to other databases, causing propagation of errors. Generally, errors are more common for sequences produced before the 1990s. ▪ The National Center for Biotechnology Information (NCBI) has created a non redundant database, called RefSeq, in which identical sequences from the same organism and associated sequence fragments are merged into a single entry. Proteins sequences derived from the same DNA sequences are explicitly linked as related entries. This carefully curated database can be considered a secondary database. SWISS-PROT database also has minimal redundancy for protein sequences compared to most other databases. 23 Information Retrieval from Biological Databases ▪ The most popular retrieval systems for biological databases are Entrez and Sequence Retrieval Systems (SRS) that provide access to multiple databases for retrieval of integrated search results. ▪ Most search engines of public biological databases use logical terms such as AND, OR, and NOT to indicate relationships between the keywords used in a search, and parentheses ( ) to define which part of the search to be executed first. 24 Entrez www.ncbi.nlm.nih.gov/gquery/gquery.fcgi ▪ It is a biological database retrieval system developed and maintained by NCBI. It allows textbased searches for a wide variety of data, such as • Annotated genetic sequence information • Structural information • Citations and abstracts • Full papers • Taxonomic data. ▪ NCBI databases accessible through Entrez are among the most integrated databases. The key feature of Entrez is its ability to integrate information, which comes from cross-referencing between NCBI databases. ▪ Users do not have to visit multiple databases. For example, in a nucleotide sequence page, one may find cross-referencing links to the translated protein sequence, genome mapping data, or to the related PubMed literature information, and to protein structures if available. 25 26 ▪ One of the databases accessible from Entrez is a biomedical literature database known as PubMed, which contains abstracts and, the full text articles from nearly 4,000 journals. PubMed retrieves the information based on medical subject headings (MeSH) terms. ▪ The MeSH system consists of a collection of more than 20,000 controlled and standardized vocabulary terms used for indexing articles. It is a dictionary that helps convert search keywords into standardized terms to describe a concept. So, it allows “smart” searches in which synonyms are used so that the user not only gets exact matches but also related matches. ▪ Another way to broaden the retrieval is by using the “Related Articles” option. PubMed uses a word weight algorithm to identify related articles with similar words in the titles, abstracts, and MeSH. ▪ For a complex search, a user can use the Boolean operators or a combination of Limits and Preview/Index features to conduct complex searches. Field tags can be used to improve the efficiency of obtaining search results. The tags are identifiers for each field and are placed in brackets. For example, [AU] limits the search for the author name, and [JID] for journal name. PubMed uses a list of tags for literature searches. 27 28 ▪ Another unique database accessible from Entrez is Online Mendelian Inheritance in Man (OMIM), which is a non-sequence-based database of human disease genes and human genetic disorders. Each entry in OMIM contains summary information about a particular disease as well as genes related to the disease. ▪ NCBI also maintains a taxonomy database that contains the names and taxonomic positions of over 100,000 organisms with at least one nucleotide or protein sequence represented in the GenBank database. ▪ GenBank is the most complete collection of annotated nucleic acid sequence data for almost every organism. The content includes genomic DNA, mRNA, raw sequence data, and there is also a GenPept database for protein sequences, the majority of which are conceptual translations from DNA sequences. ▪ There are two ways to search for sequences in GenBank. One is using text-based keywords similar to a PubMed search. The other is using molecular sequences by 29 sequence similarity using BLAST The sequence databases contain these elements: • Accession Number: A constant data identifier that used to identify a sequence. It is a string of letters and/or numbers. • Definition: A brief, one-line, textual sequence description. • Source and taxonomy information. • Complete literature references. • Comments and keywords. • The important FEATURE table • A summary or checksum line. • The sequence itself. 30 31 32 33 34 GenBank Sequence Format GenBank is a relational database. However, the search output for sequence files is produced as flat files for easy reading. The resulting flat files contain three sections – Header, Features, and Sequence entry. The Header section describes the origin of the sequence, identification of the organism, and unique identifiers associated with the record. The top line of the Header section is the Locus, which contains a unique database identifier for a sequence location in the database (not a chromosome locus), sequence length and molecule type (e.g., DNA or RNA), then three-letter code for GenBank divisions. There are 17 divisions in total, for example, PLN for plant, fungal, and algal sequences; BCT for bacterial sequences. Accession number for the sequence, which is a unique number assigned to a piece of DNA when it was first submitted to GenBank and is permanently associated with that sequence. This is the number that should be cited in publications, there is also a version. The “Features” section includes annotation information about the gene and gene product, as well as regions of biological significance reported in the sequence. The “Source” field provides the length of the sequence, the scientific name of the organism, and the taxonomy identification number. The “gene” field is the information about the nucleotide coding sequence and its name. 35 The third section of the flat file is the sequence itself starting with the label “ORIGIN.” The format of the sequence display can be changed by choosing options at a pull-down menu at the upper left corner. For DNA entries, there is a BASE COUNT report that includes the numbers of A, G, C, and T in the sequence. This section, for both DNA or protein sequences, ends with two forward slashes (the “//” symbol). 36 Alternative Sequence Formats FASTA is one of the simplest and the most popular sequence formats because it contains plain sequence information that is readable by many bioinformatics analysis programs. It has a single definition line that begins with a right-angle bracket (>) followed by a sequence name. The plain sequence in standard one-letter symbols starts in the second line. Each line of sequence data is limited to sixty to eighty characters in width. The drawback of this format is that much annotation information is lost. 37 EMBL-EBI Sequence retrieval system (https://www.ebi.ac.uk/) is a retrieval system maintained by the EBI, which is comparable to NCBI Entrez. It is not as integrated as Entrez, but allows the user to query multiple databases simultaneously, it is another good example of database integration. It also offers direct access to certain sequence analysis applications such as sequence similarity searching and Clustal sequence alignment. 38 Conversion of Sequence Formats Readseq is one of the most popular computer programs for sequence format conversion maintained by the EBI. It recognizes sequences in almost any format and writes a new file in an alternative format. 39 Algorithms and Complexity Algorithms • An algorithm is a sequence of instructions that one must perform in order to solve a wellformulated problem. We will specify problems in terms of their inputs and outputs, and the algorithm will be the method of translating the inputs into the outputs. A well-formulated problem is unambiguous& precise. • To understand how an algorithm works, we need some way of listing the steps that the algorithm takes, while being neither too vague nor too formal. We will use pseudocode. Pseudocode is a language computer scientists often use to describe algorithms: it ignores many of the details that are required in a programming language. • A problem describes a class of computational tasks. A problem instance is one particular input from that class. Correct versus Incorrect Algorithms •We say that an algorithm is correct when it can translate every input instance into the correct output. An algorithm is incorrect when there is at least one input instance for which the algorithm does not produce the correct output. So, unless you can justify that an algorithm always returns correct results, you should consider it to be wrong. Fast versus Slow Algorithms •You could estimate the running time of the algorithm simply by taking the product of the number of operations and the time per operation. However, computing devices are constantly improving, leading to a decreasing time per operation, so your notion of the running time would soon be outdated. Rather than computing an algorithm’s running time on every computer, we rely on the total number of operations that the algorithm performs to describe its running time, since this is an attribute of the algorithm, and not an attribute of the computer you are using. Unfortunately, determining how many operations an algorithm will perform is not always easy. If we know how to compute the number of basic operations that an algorithm performs, then we have a basis to compare it against a different algorithm that solves the same problem. Rather than tediously count every multiplication and addition, we can perform this comparison by gaining a high-level understanding of the growth of each algorithm’s operation count as the size of the input increases. •To illustrate this, suppose an algorithm •“A” performs 11(n³) operations on an input of size n, • “B” solves the same problem in 99(n²) + 7 operations. Which algorithm A or B is faster? •Although, A may be faster than B for some small n (e.g., for n between 0 and 9), B will become faster with large n (e.g., for all n ≥ 10). •Since n³ is, in some sense, a “faster-growing” function than n² with respect to n, the constants 11, 99, and 7 do not affect the competition between the two algorithms for large n. We refer to A as a cubic algorithm and to B as a quadratic algorithm and say that A is less efficient than B because it performs more operations to solve the same problem when n is large. 43 44 Big-O Notation • Computer scientists use the Big-O notation to describe concisely the running time of an algorithm. If we say that the running time of an algorithm is quadratic, or O(n²), it means that the running time of the algorithm on an input of size n is limited by a quadratic function of n. That limit may be 99.7(n²) or 0.001(n²) or 5(n²)+3.2n+99993 • The main factor that describes the growth rate of the running time is the term that grows the fastest with respect to n, for example n² when compared to terms like 3.2n, or 99993. All functions with a leading term of n² have more or less the same rate of growth, so we lump them into one class which we call O(n²). • The difference in behavior between two quadratic functions in that class, say 99.7(n²) and 5(n²) + 3.2n + 99993, is negligible when compared to the difference in behavior between two functions in different classes, say 5(n²) + 3.2n and 1.2(n³). Of course, 99.7(n²) and 5(n²) are different functions and we would prefer an algorithm that takes 5(n²) operations to an algorithm that takes 99.7(n²). However, computer scientists typically ignore the leading constant and pay attention only to the fastest growing term. 45 Algorithm Design Techniques • We will illustrate the most common algorithm design techniques. Future examples can be categorized in terms of these design techniques. • To illustrate the design techniques, we will consider a very simple problem. Suppose your cordless phone rings, but you have misplaced the handset somewhere in your home, and it is dark out, so you cannot rely solely on eyesight. How do you find it ? 46 • Exhaustive Search (brute force algorithm) ▪ An exhaustive search, or brute force algorithm examines every possible alternative to find one particular ▪ ▪ solution. For example, if you used the brute force algorithm to find the ringing telephone, you would ignore the ringing of the phone, and simply walk over every square inch of your home checking to see if the phone was present. You probably would not be able to answer the phone before it stopped ringing, unless you were very lucky, but you would be guaranteed to eventually find the phone no matter where it was. These are the easiest algorithms to design and understand, and sometimes they work acceptably for certain practical problems in biology. In general, though, brute force algorithms are too slow to be practical for anything, but the smallest instances and we will try to avoid the brute algorithms or to finesse them into faster versions. Branch-and-Bound Algorithms ▪ In certain cases, as we explore the various alternatives in a brute force algorithm, we discover that we can ▪ omit a large number of alternatives, a technique that is often called branch-and-bound, or pruning. Suppose you were exhaustively searching the first floor and heard the phone ringing above your head. You could immediately rule out the need to search the basement or the first floor. What may have taken three hours may now only take one, depending on the amount of space that you can rule out. 47 Greedy Algorithms ▪ Many algorithms are iterative procedures that choose among a number of alternatives at each iteration. For example, you can view the problem as a series of decisions you have to make. Some of these alternatives may lead to correct solutions while others may not. Greedy algorithms choose the “most attractive” alternative at each iteration. The generalization of the greedy strategy, may produce incorrect results. ▪ In the telephone example, the corresponding greedy algorithm would simply be to walk in the direction of the telephone’s ringing until you found it. The problem here is that there may be a wall (or an expensive and fragile vase) between you and the phone, preventing you from finding it. Unfortunately, these sorts of difficulties frequently occur in most realistic problems. In many cases, a greedy approach will seem “obvious” and natural but will be subtly wrong. Dynamic Programming ▪ Some algorithms break a problem into smaller subproblems and use the solutions of the subproblems to construct the solution of the larger one. During this process, the number of subproblems may become very large, and some algorithms solve the same subproblem repeatedly, needlessly increasing the running time. Dynamic programming organizes computations to avoid recomputing values that you already know, which can often save a great deal of time. 48 Divide-and-Conquer Algorithms •One big problem may be hard to solve, but two problems that are half the size may be significantly easier. Divideand-conquer algorithms split the problem into smaller subproblems, solving the subproblems independently, and combining the solutions of subproblems into a solution of the original problem. The situation is usually more complicated than this and after splitting one problem into subproblems, a divide-and-conquer algorithm usually splits these subproblems into even smaller sub-subproblems, and so on, until it reaches a point at which it no longer needs to recurse. A critical step in many divide-and-conquer algorithms is the recombining of solutions to subproblems into a solution for a larger problem. Often, this merging step can consume a considerable amount of time. Machine Learning •Machine learning algorithms often base their strategies on the computational analysis of previously collected data. To solve phone search problem using this approach is by collecting statistics over the course of a year about where you leave the phone, learning where the phone tends to end up most of the time. If the phone was left in the living room 80% of the time, in the bedroom 15% of the time, and in the kitchen 5% of the time, then a sensible time-saving strategy would be to start the search in the living room, continue to the bedroom, and finish in the kitchen. 49 Randomized Algorithms •If you have a coin, then before even starting to search for the phone, you could toss it to decide whether you want to start your search on the first floor if the coin comes up heads, or on the second floor if the coin comes up tails. Although tossing coins may be a fun way to search for the phone, it is certainly not the intuitive thing to do, nor is it at all clear whether it gives you any algorithmic advantage over a deterministic algorithm. Randomized algorithms help to solve practical problems, some of them have a competitive advantage over deterministic algorithms. 50 Algorithms: - Sequence Alignment - DNA restriction mapping - Motif Finding 51 Sequence Alignment As new biological sequences are generated at exponential rates, sequence comparison is becoming increasingly important to draw a functional and evolutionary inference of a new protein with proteins already existing in the database. The most fundamental process in this type of comparison is sequence alignment. Pairwise sequence alignment is the process of aligning two sequences and is the basis of database similarity searching and multiple sequence alignment. The degree of sequence conservation in the alignment reveals evolutionary relatedness of different sequences, regions that are aligned but not identical represent residue substitutions; regions where residues from one sequence correspond to nothing in the other represent insertions or deletions that have taken place on one o the sequences during evolution. When a sequence alignment reveals significant similarity among a group of sequences, they can be considered as belonging to the same family. If one member within the family has a known structure and function, then that information can be transferred to those that have not yet been experimentally characterized. Therefore, sequence alignment 52 can be used as basis for prediction of structure and function of uncharacterized sequences. It is important to distinguish sequence homology from the related term sequence similarity because the two terms are often confused by some researchers. Sequence homology is a conclusion about a common ancestral relationship drawn from sequence similarity comparison when the two sequences share a high enough degree of similarity. On the other hand, similarity is a direct result of observation from the sequence alignment. Sequence similarity can be quantified using percentages; homology is a qualitative statement. For example, one may say that two sequences share 40% similarity. It is incorrect to say that the two sequences share 40% homology. They are either homologous or nonhomologous. In global alignment, two sequences to be aligned are assumed to be generally similar over their entire length. Alignment is carried out from beginning to end of both sequences to find the best possible alignment across the entire length between the two sequences. This method is more applicable for aligning two closely related sequences of roughly the same length. For divergent sequences and sequences of variable lengths, this method may not be able to generate optimal results. Local alignment does not assume that the two sequences have similarity over the entire length. It only finds local regions with the highest level of similarity between the two sequences and aligns these regions without regard for the alignment of the rest of the sequence regions. This approach can be used for aligning more divergent sequences with the goal of searching for conserved patterns in DNA or protein sequences. The two sequences to be aligned can be of different lengths. This approach is more appropriate for aligning divergent biological sequences containing only modules that are similar, which are referred to as domains or motifs. 53 54