Ch. 24 Bioinformatics Bioinformatics is the application of computer

advertisement
Ch. 24 Bioinformatics
Bioinformatics is the application of computer science to the field of molecular biology.
Outline
1. Basics of bioinformatics
a. Definition
b. Data overload
c. Top goals for Bioinformatics
2. NCBI website
a. Entrez: Database Portal
b. Databases
i. Entrez Protein
ii. Entrez Gene
iii. OMIM
iv. Unigene
v. MeSH
vi. Pubmed
3. Other useful sites
a. European Bioinformatics Institute and Ensembl
b. ExPASy Sequence Retrieval System
3. Assession numbers & RefSeq
4. FASTA format
5. Pair-wise Sequence Alignment
a. BLAST
b. Similarity, identity, and conservation.
c. Homolog, ortholog, and paralog.
1.a Definition
Bioinformatics entails the creation and advancement of databases, algorithms, computational
and statistical techniques, and theory to solve formal and practical problems arising from the
management and analysis of biological data.
1.b Data Overload
Over the past few decades rapid developments in genomic and other molecular research
technologies and developments in information technologies have combined to produce a
tremendous amount of information related to molecular biology. Since the mid-nineties the
amount of data that has been published has been exponentially increasing and has created a
real problem of accessibility. Online Databases started making this information accessible
became paramount for many reasons, including avoiding repetition of experiments.
Bioinformatics was applied in the creation and maintenance of a database to store biological
information at the beginning of the "genomic revolution", such as nucleotide and amino acid
sequences. Development of this type of database involved not only design issues but the
development of complex interfaces whereby researchers could both access existing data as
well as submit new or revised data. In figure 1.b.1 we see the individual databases that manage
data for that specific area of study. The important thing to not is that molecular biology is very
complex and many areas of study are interrelated. It is a key role of bioinformatics to create a
usable interface that combines many of the databases on one easy to use platform. NCBI has
been kind enough to give us entrez.
Figure 1.b.1
1.c Ten Problems facing Bioinformatics
Doing a survey of bioinformatics as we are doing only touches on the difficult and multifaceted
nature of this interface driven field. Below are a list of obstacles to a number of fields of
molecular biology that are limiting the large arching goals of bioinformatics.
[1] Precise models of where and when transcription will occur in a genome (initiation and
termination)
[2] Precise, predictive models of alternative RNA splicing
[3] Precise models of signal transduction pathways
[4] Determining protein: DNA, protein: RNA, protein: protein recognition codes
[5] Accurate ab initio protein structure prediction
[6] Rational design of small molecule inhibitors of proteins
[7] Mechanistic understanding of protein evolution
[8] Mechanistic understanding of speciation
[9] Development of effective gene ontologies: systematic ways to describe gene and protein
function
[10] Education: development of bioinformatics curricula
2. NCBI – National Center for Biotechnology Information
URL: http://www.ncbi.nlm.nih.gov
The NCBI houses genome sequencing data in GenBank and an index of biomedical research
articles in PubMed Central and PubMed, as well as other information relevant to biotechnology.
All these databases are available online through the Entrez search engine.
The NCBI is part of the National Library of Medicine.
2.a Entrez: Database Portal
URL: http://www.ncbi.nlm.nih.gov/gquery/
Entrez is the integrated, text-based search and retrieval system used at NCBI for the major
databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, and others. In Figure 2.a.1 a schematic diagram of all the databases and
how many articles, sequences, etc. that are contained in each.
Figure 2.a.1
The Entrez front page provides, by default, access to the global query. All databases indexed by
Entrez can be searched via a single query string, supporting boolean operators and search term
tags to limit parts of the search statement to particular fields. This returns a unified results page,
that shows the number of hits for the search in each of the databases, which are also links to
actual search results for that particular database.
Figure 2.a.2
Entrez also provides a similar interface for searching each particular database and for refining
search results. The Limits feature allows the user to narrow a search a web forms interface. The
History feature gives a numbered list of recently performed queries. Results of previous queries
can be referred to by number and combined via boolean operators. Search results can be saved
temporarily in a Clipboard. Users with a MyNCBI account can save queries indefinitely and also
choose to have updates with new search results e-mailed for saved queries of most databases.
It is widely used in the field of biotechnology to enhance the knowledge of students worldwide.
The following list of databases is by no means exhausting but is rather intending to be a guide
to some of the more useful areas of Entrez.
2.a.i Entrez Protein
In this database you can such for certain proteins and retrieve amino acid sequence
information.The protein entries in this Entrez database have been compiled from a variety of
sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions
in GenBank and RefSeq.
2.a.ii Entrez Gene
In this database you can information about relating to genes. It does not include all known or
predicted genes; instead Entrez Gene focuses on the genomes that have been completely
sequenced, that have an active research community to contribute gene-specific information, or
that are scheduled for intense sequence analysis. The content of Entrez Gene represents the
result of curation and automated integration of data from NCBI's Reference Sequence project
(RefSeq), from collaborating model organism databases, and from many other databases
available from NCBI. Records are assigned unique, stable and tracked integers as identifiers.
The content (nomenclature, map location, gene products and their attributes, markers,
phenotypes, and links to citations, sequences, variation details, maps, expression, homologs,
protein domains and external databases) is updated as new information becomes available.
2.a.iii OMIM (Online Mendelian Inheritance in Man)
OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic
phenotypes. The full-text, referenced overviews in OMIM contain information on all known
mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between
phenotype and genotype. It is updated daily, and the entries contain copious links to other
genetics resources. (MIM number)
2.a.iv UniGene
UniGene is an NCBI database of the transcriptome and thus, despite the name, not primarily a
database for genes. Each entry is a set of transcripts that appear to stem from the same
transcription locus (i.e. gene or expressed pseudogene). The type of cluster of genes that
UniGene tries to catalog can be seen in figure below. Information on protein similarities, gene
expression, cDNA clones, and genomic location is included with each entry.
Figure 2.a.3
2.a.v Mesh
The MeSH Browser is an online vocabulary look-up aid available for use with MeSH (Medical
Subject Headings). It is designed to help quickly locate descriptors of possible interest and to
show the hierarchy in which descriptors of interest appear. Virtually complete MeSH records are
available, including the scope notes, annotations, entry vocabulary, history notes, allowable
qualifiers, etc. The browser does not link directly to any MEDLINE or other database retrieval
system and thus is not a substitute for the PUBMED system.
2.a.vi PubMed
PubMed is a database accessing the MEDLINE database of citations, abstracts and some full
text articles on life sciences and biomedical topics. For comprehensive, optimal searching in
PubMed, it is necessary to have a thorough understanding of its core component, MEDLINE,
and especially of the MeSH controlled vocabulary used to index MEDLINE articles.
3. Other websites
3.a European Bioinformatics Institute and Ensembl
URL: http://uswest.ensembl.org/index.html
The Ensembl project was started in 1999, some years before the draft human genome was
completed. Even at that early stage it was clear that manual annotation of 3 billion base pairs of
sequence would not be able to offer researchers timely access to the latest data. The goal of
Ensembl was therefore to automatically annotate the genome, integrate this annotation with
other available biological data and make all this publicly available via the web. Since the
website's launch in July 2000, many more genomes have been added to Ensembl and the
range of available data has also expanded to include comparative genomics, variation and
regulatory data. The picture below (3.a.1) shows all the complete genomes that can be found on
Ensembl.
Figure 3.a.1
3.b ExPASy Sequence Retrieval System
URL: http://www.expasy.org/
The ExPASy (Expert Protein Analysis System) is a proteomics server of the Swiss Institute of
Bioinformatics (SIB) which analyzes protein sequences and structures and two-dimensional gel
electrophoresis (2-D Page electrophoresis). The server functions in collaboration with the
European Bioinformatics Institute. ExPASy also produces the protein sequence knowledgebase,
UniProtKB/Swiss-Prot, and its computer annotated supplement, UniProtKB/Trembl.
4 - Assession numbers and RefSeq
An accession number in bioinformatics is a unique identifier given to a DNA or protein sequence
record to allow for tracking of different versions of that sequence record and the associated
sequence over time in a single data repository. Because of its relative stability, accession
numbers can be utilized as foreign keys for referring to a sequence object, but not necessarily to
a unique sequence. All sequence information repositories implement the concept of "accession
number" but might do so with subtle variations.
RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number
that corresponds to the most stable, agreed-upon “reference” version of a sequence.
4 FASTA format
In bioinformatics, FASTA format (a.k.a. Pearson format) is a text-based format for representing
either nucleotide sequences or peptide sequences, in which base pairs or amino acids are
represented using single-letter codes. The format also allows for sequence names and
comments to precede the sequences.
A sequence in FASTA format is represented as a series of lines, which should be no longer than
120 characters and usually do not exceed 80 characters. This probably was because to allow
for preallocation of fixed line sizes in software: at the time, most users relied on DEC VT (or
compatible) terminals which could display 80 or 132 characters per line. Most people would
prefer normally the bigger font in 80-character modes and so it became the recommended
fashion to use 80 characters or less (often 70) in FASTA lines.
A sequence in FASTA format begins with a single-line description, followed by lines of
sequence data. The description line is distinguished from the sequence data by a greater-than
(">") symbol in the first column. The word following the ">" symbol is the identifier of the
sequence, and the rest of the line is the description (both are optional). There should be no
space between the ">" and the first letter of the identifier. It is recommended that all lines of text
be shorter than 80 characters. The sequence ends if another line starting with a ">" appears;
this indicates the start of another sequence. In figure 4.1 a typical FASTA format sequence can
be seen.
Figure 4.1
5 Pair-wise sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or
protein to identify regions of similarity that may be a consequence of functional, structural, or
evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino
acid residues are typically represented as rows within a matrix. Gaps are inserted between the
residues so that identical or similar characters are aligned in successive columns.
Pair-wise sequence alignment methods are used to find the best-matching piecewise (local) or
global alignments of two query sequences. Pair-wise alignments can only be used between two
sequences at a time, but they are efficient to calculate and are often used for methods that do
not require extreme precision (such as searching a database for sequences with high similarity
to a query). The three primary methods of producing pair-wise alignments are dot-matrix
methods, dynamic programming, and word methods; however, multiple sequence alignment
techniques can also align pairs of sequences. Although each method has its individual strengths
and weaknesses, all three pair-wise methods have difficulty with highly repetitive sequences of
low information content - especially where the number of repetitions differ in the two sequences
to be aligned. One way of quantifying the utility of a given pair-wise alignment is the 'maximum
unique match', or the longest subsequence that occurs in both query sequence. Longer MUM
sequences typically reflect closer relatedness.
5.1 BLAST
A BLAST search enables a researcher to compare a query sequence with a library or database
of sequences, and identify library sequences that resemble the query sequence above a certain
threshold. For example, following the discovery of a previously unknown gene in the mouse, a
scientist will typically perform a BLAST search of the human genome to see if humans carry a
similar gene; BLAST will identify sequences in the human genome that resemble the mouse
gene based on similarity of sequence.
Input – FASTA format sequence
Output-When performing a BLAST on NCBI, the results are given in a graphical format showing
the hits found, a table showing sequence identifiers for the hits with scoring related data, as well
as alignments for the sequence of interest and the hits received with corresponding BLAST
scores for these. The easiest to read and most informative of these is probably the table. Figure
5.1.1 shows the different types of basic BLAST searches that NCBI offers. There are also
specialized BLAST searches that use more finely tuned algorithm.
Figure 5.1.1
To run a BLAST sequence alignment query you have to enter two FASTA sequences, set gap
parameters, set the penalty for a mismatch, and most importantly, the correct algorithm that the
search will be based on. Depending on the size of the sequences or the relative accuracy of the
comparison, there are a series of algorithms to choose from. The NCBI website will offers quick
tips on which BLAST search to run. Figure 5.1.2 shows the typical BLAST page on the NCBI
website.
Figure 5.1.2
The image below (figure 5.1.3) is a typical image that comes from a BLAST sequencing
analysis. The first line belongs to the first sequence entered and the second, the second
entered. Whenever there is an exact match between two elements in the string there is a line
drawn between the two of them. When there is not an exact match there is a system of one and
two dots that signify how close the replaced amino acid or nucleotide matches the
physiochemical profile of the first. One dot between the pair signifies a somewhat similar profile
while two dots represent a very similar profile.
Figure 5.1.3
The dot based system of similarity, as mentioned before, is based on how closely the sequence
matches the original physiochemical profile. This is what is so called conservation. So similarity
is based ultimately on the extent that the two sequences are invariable and the conservation of
physiochemical properties.
Figure 5.1.4
5.2 Homologs, Paralogs, and Orthologs
The relative similarity between two sequences of a certain protein or gene in one animal and
that of another is very important information. Similarity between two sequences can come about
through various means. The origins of the similarity could come from the fact that the species
are homologs, in that they share a common ancestor. The similarity could also arise from the
two sequences being orthologs, in that the
homologous sequences in different species
arose from a common ancestral gene
during speciation. In this case, similarity in
sequence does not necessitate similarity in
function. Finally, the similarity could come
about because the two sequences are
paralogs. In this case, homologous
sequences in one species arise from
random gene duplication. Figure 5.2.1
below shows schematically how these
three origins of similarity are related.
Figure 5.2.1
Download