Uploaded by Menna Mostafa

e079fb5b-a551-4f6b-8873-a5628ae7a7fa

advertisement
Computational Biology
Lecture (2)
Biological Databases
By Dr. Safynaz AbdEl-Fattah
Computer Science
Human Genome Project From
• In 1990, the Human Genome Project (HGP) began.
• the HGP was led by an international team of researchers looking to sequence
of human and map all of the genes together.
• Completed in April 2003,
• for the first time, the HGP gave us the ability to read nature's complete
genetic blueprint for building a human being,
• by publishing the sequence of
• the entire genome's three billion base pairs.
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
. .
.
.
3
4
Bioinformatics consists of two subfields
▪
▪
Development of computational tools& databases.
Application of these tools& databases in generating biological knowledge.
These two subfields are complementary to each other.
The tool development includes writing software for sequence, structural, and functional analysis,
as well as the construction of biological databases.
The analyses of biological data often generate new problems and challenges that in turn
encourage the development of new and better computational tools.
5
Computational Tools & Databases
Developers ‘Bioinformaticians’
o Computer scientists
o Software engineers
o Mathematicians
o Statisticians
o Computational biologists
Development
maintenance
Application
Development &
maintenance
Computational
tools
6
Databases
Application
Users
• Biology Researchers
• Biology Scientists
The computational tools are used in three areas of genomic and molecular biological research:
• Molecular sequence analysis
 sequence alignment





•
Molecular structural analysis




•
sequence database searching
motif and pattern discovery
gene and promoter finding
reconstruction of evolutionary relationships
genome assembly and comparison
nucleic acid structure prediction
protein structure prediction
protein structure comparison
protein structure classification
Molecular functional analysis




gene expression profiling
protein–protein interaction prediction
protein subcellular localization prediction,
metabolic pathway reconstruction, and simulation
The three areas of bioinformatics analysis are not isolated but often interact to produce integrated
results For example, protein structure prediction depends on sequence alignment7 data.
Limitations of bioinformatics
▪
Relying on intelligence can be dangerous if the intelligence is of limited accuracy.
Therefore, poor-quality intelligence can yield costly mistakes if not complete
failures.
▪
Bioinformatics predictions are not formal proofs of any concepts. They do not
replace the traditional experimental research methods of testing hypotheses.
▪
The quality of bioinformatics predictions depends on the quality of data and the
sophistication of the algorithms being used.
▪
Sequence data from high throughput analysis often contains errors. If the sequences
are wrong or annotations are incorrect, the results from the downstream analysis are
misleading as well.
8
Limitations of bioinformatics
▪
▪
▪
▪
▪
▪
▪
The outcome of computation also depends on the computing power available.
Many accurate but exhaustive algorithms cannot be used because of the slow rate of
computation.
Instead, less accurate but faster algorithms have to be used. This is a necessary tradeoff between accuracy and computational feasibility.
Therefore, it is important to keep in mind the potential for errors produced by
bioinformatics programs.
Caution should always be exercised when interpreting prediction results.
It is a good practice to use multiple programs, if they are available, and perform
multiple evaluations.
A more accurate prediction can often be obtained if one draws a consensus by
comparing results from different algorithms.
9
Databases
▪
One of the hallmarks of modern genomic research is the generation of huge amounts of raw
sequence data. As the volume of genomic data grows, sophisticated computational methodologies
are required to manage this huge data.
▪
Thus, the first challenge in the genomics era is to store and handle the huge volume of information
through the establishment and use of computer databases.
The development of databases to handle the vast amount of molecular biological data is thus a
fundamental task of bioinformatics.
▪
▪
▪
Although data retrieval is the main purpose of all databases, biological databases often have a
higher level of requirement, known as knowledge discovery, which refers to the identification of
connections between pieces of information that were not known when the information was first
entered.
For example, databases containing raw sequence information can perform extra computational tasks
to identify sequence homology or conserved motifs. These features facilitate the discovery of new
biological insights from raw data.
10
BIOLOGICAL DATABASES

Based on database structures:
1. Flat files
2. Relational database
3. Object oriented.

Based on contents,
1. Primary databases
2. Secondary databases
3. Specialized databases.
11
Types of Biological Databases Based Structures:
▪
Current biological databases use all three types of database structures: flat files, relational, and object
oriented. Despite the obvious drawbacks of using flat files in database management, many biological
databases still use this format. The justification for this is that this system involves minimum amount of
database design and the search output can be easily understood by working biologists.
•
Flat files Databases: flat file format is a long text file that contains many entries separated by a delimiter, a
special character such as a vertical bar (|). Within each entry are number of fields separated by tabs or
commas. The entire text file does not contain any hidden instructions for computers to search for specific
information or to create reports based on certain fields from each record. The text file can be considered a
single table. Thus, to search for a piece of information, a computer has to read through the entire file, which
inefficient process. As database size increases or data types become more complex, this database style can
become very difficult for information retrieval.
•
To facilitate the efficient access and retrieval of data, sophisticated computer software programs for
organizing, searching, and accessing data have been developed. They are called database management
systems. Depending on the types of data structures, these database management systems can be classified
into: relational database management systems and object-oriented database management systems.
Consequently, databases employing these management systems are known as relational databases or object12
oriented databases, respectively.
•
Relational Databases: relational databases use a set of tables to organize data. Each table,
is made up of columns and rows. Columns represent individual fields. Rows represent
values in the fields of records. The columns in a table are indexed according to a common
feature called an attribute, so they can be cross-referenced in other tables. Different tables
can be linked by common data categories, which facilitate finding of specific information.
To execute a query in a relational database, the system selects linked data items from
different tables and combines the information into one report. Therefore, specific
information can be found more quickly from a relational database than from a flat file
database.
•
Object-Oriented Databases: object-oriented databases have been developed that store
data as objects. In an object-oriented programming language, an object can be considered
as a unit that combines data and mathematical routines that act on the data. The database is
structured such that the objects are linked by a set of pointers defining predetermined
relationships between the objects. Searching the database involves navigating through the
objects with the aid of the pointers linking different objects. Programming languages like
C++ are used to create object-oriented databases.
13
Types of Biological Databases Based on Contents
▪
Biological databases can be roughly divided into three categories: primary databases,
secondary databases, and specialized databases, based on their contents.
▪
Primary databases contain original biological data. They are archives of raw sequence or structural data
submitted by the scientific community. GenBank and Protein Data Bank (PDB) are examples of primary
databases.
▪
Secondary databases contain computationally processed or manually curated information, based on original
information from primary databases. Translated protein sequence databases containing functional annotation
belong to this category. Examples are SWISS-Prot and Protein Information Resources (PIR)
▪
Specialized databases are those that serve a particular research interest. For example,
Flybase, HIV sequence database, and Ribosomal Database Project are databases that
specialize in a particular organism or a particular type of data.
14
Primary Databases
▪ There are three major public sequence databases that store raw nucleic acid sequence data
produced and submitted by researchers worldwide: GenBank, the European Molecular Biology
Laboratory (EMBL) database and the DNA Data Bank of Japan (DDBJ), which are all freely
available on the Internet. Most of the data in the databases are contributed directly by authors with
a minimal level of annotation.
▪
These public databases closely collaborate and exchange new data daily. They together constitute
the International Nucleotide Sequence Database Collaboration. By connecting to any one of the
three databases, one should have access to the same nucleotide sequence data. Although the three
databases contain the same sets of raw data, each has a slightly different kind of format to
represent the data.
▪
There is only one centralized database, the Protein Data Bank (PDB), for the three-dimensional
structures of biological macromolecules. This database archives atomic coordinates of
macromolecules (both proteins and nucleic acids) determined by x-ray crystallography. It uses a
flat file format to represent protein name, authors, experimental details, secondary structure, and
atomic coordinates. The web interface of PDB also provides viewing tools for simple image
16
manipulation.
Secondary databases
▪ Secondary databases contain computationally processed sequence information derived from
the primary databases. The amount of computational processing work varies greatly among
the secondary databases; some are simple archives of translated sequence data from identified
open reading frames in DNA, whereas others provide additional annotation and information
related to higher levels of information regarding structure and functions.
▪
SWISS-PROT is an example of secondary databases, which provides detailed sequence
annotation that includes structure, function, and protein family assignment. The sequence data
are mainly derived from TrEMBL, a database of translated nucleic acid sequences stored in
the EMBL database.
▪
The protein annotation includes function, domain structure, disease association, and similarity
with other sequences. The annotation provides significant added value to each original
sequence record. The data record also provides cross-referencing links to other online
resources of interest. Other features such as very low redundancy and high level of integration
with other primary and secondary databases make SWISS-PROT very popular among
biologists.
18
Specialized databases
▪
Specialized databases focus on a specific research community or a particular
organism. The content of these databases may be sequences or other types of
information. The sequences in these databases may overlap with a primary
database. Many genome databases that are taxonomic specific fall within this
category.
▪
Examples of specialized databases are Flybase, WormBase, AceDB, and TAIR.
In addition, there are also specialized databases that contain original data derived
from functional analysis. For example, Microarray Gene Expression Database at
the European Bioinformatics Institute (EBI) are some of the gene expression
databases available.
19
Databases and Retrieval Systems
Brief Summary of Content
URL
Primary databases
GenBank
EMBL
DDBJ
Primary nucleotide sequence
database in NCBI
Primary nucleotide sequence
database in Europe
Primary nucleotide sequence
database in Japan
www.ncbi.nlm.nih.gov/Genbank
www.ebi.ac.uk/embl/index.html
www.ddbj.nig.ac.jp
Secondary databases
SWISS-Prot
PIR
Curated protein sequence database
Annotated protein sequences
www.ebi.ac.uk/swissprot/access.html
https://proteininformationresource.org/pirwww/
Specialized databases
FlyBase
A database of the Drosophila genome
https://flybase.org/
TAIR
Arabidopsis information database
www.arabidopsis.org
AceDB
Genome database for Caenorhabditis elegans
https://dbdb.io/db/acedb
HIV databases
HIV sequence data and related
immunologic information
Ribosomal RNA sequences and phylogenetic trees
derived from the sequences
www.hiv.lanl.gov/content/index
DNA microarray data and analysis tools
www.ebi.ac.uk/microarray
Ribosomal database project
Microarray gene expression
database
http://rdp.cme.msu.edu/html
20
Interconnection between Biological Databases
▪
As mentioned, primary databases are central repositories and distributors of raw sequence
and structure information. They support nearly all other types of biological databases.
Therefore, there is a need for the secondary and specialized databases to connect to the
primary databases and to keep uploading sequence information. In addition, a user often
needs to get information from both primary and secondary databases. Instead of letting
users visiting multiple databases, it is suitable for entries in a database to be crossreferenced and linked to related entries in other databases that contain additional
information. All these create a demand for linking different databases.
▪
The main barrier to linking different biological databases is format incompatibility current
biological databases utilize all three types of database structures – flat files, relational, and
object oriented. The heterogeneous database structures limit communication between
databases. One solution to networking the databases is to use a specification language,
which allows database programs at different locations to communicate in a network through
an “interface broker” without having to understand each other’s database structure.
22
Pitfalls of Biological Databases
There are many errors in sequence databases.
▪ There are also high levels of redundancy in the primary sequence databases.
▪ Annotations of genes can be false or incomplete.
All these types of errors can be passed on to other databases, causing propagation of errors.
Generally, errors are more common for sequences produced before the 1990s.
▪

The National Center for Biotechnology Information (NCBI) has created a non redundant
database, called RefSeq, in which identical sequences from the same organism and
associated sequence fragments are merged into a single entry. Proteins sequences derived
from the same DNA sequences are explicitly linked as related entries. This carefully
curated database can be considered a secondary database.

SWISS-PROT database also has minimal redundancy for protein sequences compared to
most other databases.
23
Information Retrieval from Biological Databases
▪
The most popular retrieval systems for biological databases are Entrez and Sequence
Retrieval Systems (SRS) that provide access to multiple databases for retrieval of
integrated search results.
▪
Most search engines of public biological databases use logical terms such as AND,
OR, and NOT to indicate relationships between the keywords used in a search, and
parentheses ( ) to define which part of the search to be executed first.
24
Entrez
www.ncbi.nlm.nih.gov/gquery/gquery.fcgi
▪
It is a biological database retrieval system developed and maintained by NCBI. It allows textbased searches for a wide variety of data, such as
• Annotated genetic sequence information
• Structural information
• Citations and abstracts
• Full papers
• Taxonomic data.
▪
NCBI databases accessible through Entrez are among the most integrated databases. The key
feature of Entrez is its ability to integrate information, which comes from cross-referencing
between NCBI databases.
▪
Users do not have to visit multiple databases. For example, in a nucleotide sequence page, one
may find cross-referencing links to the translated protein sequence, genome mapping data, or to
the related PubMed literature information, and to protein structures if available. 25
26
▪
One of the databases accessible from Entrez is a biomedical literature database known as
PubMed, which contains abstracts and, the full text articles from nearly 4,000 journals. PubMed
retrieves the information based on medical subject headings (MeSH) terms.
▪
The MeSH system consists of a collection of more than 20,000 controlled and standardized
vocabulary terms used for indexing articles. It is a dictionary that helps convert search keywords
into standardized terms to describe a concept. So, it allows “smart” searches in which synonyms
are used so that the user not only gets exact matches but also related matches.
▪
Another way to broaden the retrieval is by using the “Related Articles” option. PubMed uses a
word weight algorithm to identify related articles with similar words in the titles, abstracts, and
MeSH.
▪
For a complex search, a user can use the Boolean operators or a combination of Limits and
Preview/Index features to conduct complex searches. Field tags can be used to improve the
efficiency of obtaining search results. The tags are identifiers for each field and are placed in
brackets. For example, [AU] limits the search for the author name, and [JID] for journal name.
PubMed uses a list of tags for literature searches.
27
28
▪
Another unique database accessible from Entrez is Online Mendelian Inheritance in Man
(OMIM), which is a non-sequence-based database of human disease genes and human
genetic disorders. Each entry in OMIM contains summary information about a particular
disease as well as genes related to the disease.
▪
NCBI also maintains a taxonomy database that contains the names and taxonomic
positions of over 100,000 organisms with at least one nucleotide or protein sequence
represented in the GenBank database.
▪
GenBank is the most complete collection of annotated nucleic acid sequence data for
almost every organism. The content includes genomic DNA, mRNA, raw sequence data,
and there is also a GenPept database for protein sequences, the majority of which are
conceptual translations from DNA sequences.
▪
There are two ways to search for sequences in GenBank. One is using text-based
keywords similar to a PubMed search. The other is using molecular sequences by
29
sequence similarity using BLAST
The sequence databases contain these elements:
•
Accession Number: A constant data identifier that used to identify a sequence.
It is a string of letters and/or numbers.
•
Definition: A brief, one-line, textual sequence description.
•
Source and taxonomy information.
•
Complete literature references.
•
Comments and keywords.
•
The important FEATURE table
•
A summary or checksum line.
•
The sequence itself.
30
31
32
33
34
GenBank Sequence Format
 GenBank is a relational database. However, the search output for sequence files is produced as flat
files for easy reading. The resulting flat files contain three sections – Header, Features, and Sequence
entry.
 The Header section describes the origin of the sequence, identification of the organism, and unique
identifiers associated with the record. The top line of the Header section is the Locus, which contains
a unique database identifier for a sequence location in the database (not a chromosome locus),
sequence length and molecule type (e.g., DNA or RNA), then three-letter code for GenBank
divisions. There are 17 divisions in total, for example, PLN for plant, fungal, and algal sequences;
BCT for bacterial sequences.
 Accession number for the sequence, which is a unique number assigned to a piece of DNA when it
was first submitted to GenBank and is permanently associated with that sequence. This is the number
that should be cited in publications, there is also a version.
 The “Features” section includes annotation information about the gene and gene product, as well as
regions of biological significance reported in the sequence. The “Source” field provides the length of
the sequence, the scientific name of the organism, and the taxonomy identification number. The
“gene” field is the information about the nucleotide coding sequence and its name. 35
 The third section of the flat file is the sequence itself starting with the label “ORIGIN.”
The format of the sequence display can be changed by choosing options at a pull-down
menu at the upper left corner. For DNA entries, there is a BASE COUNT report that
includes the numbers of A, G, C, and T in the sequence. This section, for both DNA or
protein sequences, ends with two forward slashes (the “//” symbol).
36
Alternative Sequence Formats
 FASTA is one of the simplest and the most popular sequence formats because it
contains plain sequence information that is readable by many bioinformatics
analysis programs. It has a single definition line that begins with a right-angle
bracket (>) followed by a sequence name. The plain sequence in standard one-letter
symbols starts in the second line. Each line of sequence data is limited to sixty to
eighty characters in width. The drawback of this format is that much annotation
information is lost.
37
EMBL-EBI
 Sequence retrieval system (https://www.ebi.ac.uk/) is a retrieval system maintained by
the EBI, which is comparable to NCBI Entrez. It is not as integrated as Entrez, but
allows the user to query multiple databases simultaneously, it is another good example
of database integration. It also offers direct access to certain sequence analysis
applications such as sequence similarity searching and Clustal sequence alignment.
38
Conversion of Sequence Formats
 Readseq is one of the most popular computer programs for sequence format conversion
maintained by the EBI. It recognizes sequences in almost any format and writes a new file
in an alternative format.
39
Algorithms and
Complexity
Algorithms
• An algorithm is a sequence of instructions that one must perform in order to solve a wellformulated problem. We will specify problems in terms of their inputs and outputs, and the
algorithm will be the method of translating the inputs into the outputs. A well-formulated
problem is unambiguous& precise.
• To understand how an algorithm works, we need some way of listing the steps that the
algorithm takes, while being neither too vague nor too formal. We will use pseudocode.
Pseudocode is a language computer scientists often use to describe algorithms: it ignores
many of the details that are required in a programming language.
• A problem describes a class of computational tasks. A problem instance is one particular
input from that class.
Correct versus Incorrect Algorithms
•We say that an algorithm is correct when it can translate every input instance into
the correct output. An algorithm is incorrect when there is at least one input
instance for which the algorithm does not produce the correct output. So, unless
you can justify that an algorithm always returns correct results, you should
consider it to be wrong.
Fast versus Slow Algorithms
•You could estimate the running time of the algorithm simply by taking the
product of the number of operations and the time per operation. However,
computing devices are constantly improving, leading to a decreasing time per
operation, so your notion of the running time would soon be outdated. Rather than
computing an algorithm’s running time on every computer, we rely on the total
number of operations that the algorithm performs to describe its running time,
since this is an attribute of the algorithm, and not an attribute of the computer you
are using.
Unfortunately, determining how many operations an algorithm will perform is not always easy. If we know
how to compute the number of basic operations that an algorithm performs, then we have a basis to compare
it against a different algorithm that solves the same problem. Rather than tediously count every
multiplication and addition, we can perform this comparison by gaining a high-level understanding of the
growth of each algorithm’s operation count as the size of the input increases.
•To illustrate this, suppose an algorithm
•“A” performs 11(n³) operations on an input of size n,
• “B” solves the same problem in 99(n²) + 7 operations.
Which algorithm A or B is faster?
•Although, A may be faster than B for some small n (e.g., for n between 0 and 9), B will become faster with
large n (e.g., for all n ≥ 10).
•Since n³ is, in some sense, a “faster-growing” function than n² with respect to n, the constants 11, 99, and 7
do not affect the competition between the two algorithms for large n. We refer to A as a cubic algorithm and
to B as a quadratic algorithm and say that A is less efficient than B because it performs more operations to
solve the same problem when n is large.
43
44
Big-O Notation
• Computer scientists use the Big-O notation to describe concisely the running time of an algorithm. If
we say that the running time of an algorithm is quadratic, or O(n²), it means that the running time of the
algorithm on an input of size n is limited by a quadratic function of n. That limit may be
99.7(n²) or
0.001(n²) or
5(n²)+3.2n+99993
• The main factor that describes the growth rate of the running time is the term that grows the fastest
with respect to n, for example n² when compared to terms like 3.2n, or 99993. All functions with a
leading term of n² have more or less the same rate of growth, so we lump them into one class which
we call O(n²).
• The difference in behavior between two quadratic functions in that class,
say 99.7(n²)
and
5(n²) + 3.2n + 99993,
is negligible when compared to the difference in behavior between two functions in different classes,
say 5(n²) + 3.2n and 1.2(n³).
Of course, 99.7(n²) and 5(n²) are different functions and we would prefer an algorithm that takes 5(n²)
operations to an algorithm that takes 99.7(n²). However, computer scientists typically ignore the leading
constant and pay attention only to the fastest growing term.
45
Algorithm Design Techniques
• We will illustrate the most common algorithm design techniques. Future examples can be
categorized in terms of these design techniques.
• To illustrate the design techniques, we will consider a very simple problem. Suppose your cordless
phone rings, but you have misplaced the handset somewhere in your home, and it is dark out, so you
cannot rely solely on eyesight.
How do you find it ?
46
•
Exhaustive Search (brute force algorithm)
▪ An exhaustive search, or brute force algorithm examines every possible alternative to find one particular
▪
▪
solution.
For example, if you used the brute force algorithm to find the ringing telephone, you would ignore the
ringing of the phone, and simply walk over every square inch of your home checking to see if the phone was
present. You probably would not be able to answer the phone before it stopped ringing, unless you were very
lucky, but you would be guaranteed to eventually find the phone no matter where it was.
These are the easiest algorithms to design and understand, and sometimes they work acceptably for certain
practical problems in biology. In general, though, brute force algorithms are too slow to be practical for
anything, but the smallest instances and we will try to avoid the brute algorithms or to finesse them into faster
versions.
Branch-and-Bound Algorithms
▪ In certain cases, as we explore the various alternatives in a brute force algorithm, we discover that we can
▪
omit a large number of alternatives, a technique that is often called branch-and-bound, or pruning.
Suppose you were exhaustively searching the first floor and heard the phone ringing above your head. You
could immediately rule out the need to search the basement or the first floor. What may have taken three
hours may now only take one, depending on the amount of space that you can rule out.
47
Greedy Algorithms
▪ Many algorithms are iterative procedures that choose among a number of alternatives at each iteration. For example,
you can view the problem as a series of decisions you have to make. Some of these alternatives may lead to correct
solutions while others may not. Greedy algorithms choose the “most attractive” alternative at each iteration. The
generalization of the greedy strategy, may produce incorrect results.
▪
In the telephone example, the corresponding greedy algorithm would simply be to walk in the direction of the
telephone’s ringing until you found it. The problem here is that there may be a wall (or an expensive and fragile vase)
between you and the phone, preventing you from finding it. Unfortunately, these sorts of difficulties frequently occur
in most realistic problems. In many cases, a greedy approach will seem “obvious” and natural but will be subtly
wrong.
Dynamic Programming
▪ Some algorithms break a problem into smaller subproblems and use the solutions of the subproblems to construct the
solution of the larger one. During this process, the number of subproblems may become very large, and some
algorithms solve the same subproblem repeatedly, needlessly increasing the running time. Dynamic programming
organizes computations to avoid recomputing values that you already know, which can often save a great deal of
time.
48
Divide-and-Conquer Algorithms
•One big problem may be hard to solve, but two problems that are half the size may be significantly easier. Divideand-conquer algorithms split the problem into smaller subproblems, solving the subproblems independently, and
combining the solutions of subproblems into a solution of the original problem. The situation is usually more
complicated than this and after splitting one problem into subproblems, a divide-and-conquer algorithm usually splits
these subproblems into even smaller sub-subproblems, and so on, until it reaches a point at which it no longer needs
to recurse. A critical step in many divide-and-conquer algorithms is the recombining of solutions to subproblems into
a solution for a larger problem. Often, this merging step can consume a considerable amount of time.
Machine Learning
•Machine learning algorithms often base their strategies on the computational analysis of previously
collected data. To solve phone search problem using this approach is by collecting statistics over the
course of a year about where you leave the phone, learning where the phone tends to end up most of
the time. If the phone was left in the living room 80% of the time, in the bedroom 15% of the time, and
in the kitchen 5% of the time, then a sensible time-saving strategy would be to start the search in the
living room, continue to the bedroom, and finish in the kitchen.
49
Randomized Algorithms
•If you have a coin, then before even starting to search for the phone, you
could toss it to decide whether you want to start your search on the first
floor if the coin comes up heads, or on the second floor if the coin comes
up tails. Although tossing coins may be a fun way to search for the
phone, it is certainly not the intuitive thing to do, nor is it at all clear
whether it gives you any algorithmic advantage over a deterministic
algorithm. Randomized algorithms help to solve practical problems,
some of them have a competitive advantage over deterministic
algorithms.
50
Algorithms:
- Sequence Alignment
- DNA restriction mapping
- Motif Finding
51
Sequence Alignment
 As new biological sequences are generated at exponential rates, sequence comparison is becoming
increasingly important to draw a functional and evolutionary inference of a new protein with proteins
already existing in the database. The most fundamental process in this type of comparison is sequence
alignment. Pairwise sequence alignment is the process of aligning two sequences and is the basis of
database similarity searching and multiple sequence alignment.
 The degree of sequence conservation in the alignment reveals evolutionary relatedness of different
sequences, regions that are aligned but not identical represent residue substitutions; regions where
residues from one sequence correspond to nothing in the other represent insertions or deletions that have
taken place on one o the sequences during evolution.
 When a sequence alignment reveals significant similarity among a group of sequences, they can be
considered as belonging to the same family.
 If one member within the family has a known structure and function, then that information can be
transferred to those that have not yet been experimentally characterized. Therefore, sequence alignment
52
can be used as basis for prediction of structure and function of uncharacterized sequences.



It is important to distinguish sequence homology from the related term sequence similarity because the two terms are
often confused by some researchers. Sequence homology is a conclusion about a common ancestral relationship drawn
from sequence similarity comparison when the two sequences share a high enough degree of similarity. On the other
hand, similarity is a direct result of observation from the sequence alignment. Sequence similarity can be quantified using
percentages; homology is a qualitative statement. For example, one may say that two sequences share 40% similarity. It is
incorrect to say that the two sequences share 40% homology. They are either homologous or nonhomologous.
In global alignment, two sequences to be aligned are assumed to be generally similar over their entire length. Alignment
is carried out from beginning to end of both sequences to find the best possible alignment across the entire length
between the two sequences. This method is more applicable for aligning two closely related sequences of roughly the
same length. For divergent sequences and sequences of variable lengths, this method may not be able to generate optimal
results.
Local alignment does not assume that the two sequences have similarity over the entire length. It only finds local regions
with the highest level of similarity between the two sequences and aligns these regions without regard for the alignment
of the rest of the sequence regions. This approach can be used for aligning more divergent sequences with the goal of
searching for conserved patterns in DNA or protein sequences. The two sequences to be aligned can be of different
lengths. This approach is more appropriate for aligning divergent biological sequences containing only modules that are
similar, which are referred to as domains or motifs.
53
54
Download