What is bioinformatics?

advertisement
What is bioinformatics?
(Adapted from the Frequently Asked
Questions page at bioinformatics.org)
In a broad sense bioinformatics describes
any use of computers to handle biological
information.
In practice, the definition used by most
people is narrower; bioinformatics is a
synonym for "computational molecular
biology"---the use of computers to
characterize the molecular components of living things.
"Classical" bioinformatics
When most biologists talk about bioinformatics they are referring to
the practice of using computers to store, compare, retrieve, analyze or
predict the function of biomolecules. Biomolecules include your genetic
material (DNA and RNA) and the products of your genes: proteins.
These are the concerns of "classical" bioinformatics, dealing primarily
with sequence analysis.
Fredj Tekaia at the Institut Pasteur offers this definition of
bioinformatics:
The mathematical, statistical and computing methods that
aim to solve biological problems using DNA and amino acid
sequences and related information.
Most large biological molecules are polymers, or ordered chains of
simpler molecular modules called monomers. Think of the monomers
as beads or building blocks which, despite having different colors and
shapes, all have the same thickness and the same way of connecting
to one another.
Monomers that can combine in a chain are of the same general class,
but each kind of monomer in that class has its own well-defined set of
characteristics. Many monomer molecules can be joined together to
form a single, far larger, macromolecule. Macromolecules can have
exquisitely specific informational content and/or chemical properties.
According to this scheme, the monomers in a given macromolecule of
DNA or protein can be treated computationally as letters of an
alphabet, put together in pre-programmed arrangements to carry
messages or do work in a cell. Bioinformatics uses computational
methods to try and figure out the information contained in these
strings of letters.
"New" bioinformatics
The greatest achievement of bioinformatics to date, the Human
Genome Project, is now completed. While sequencing of various
genomes is still ongoing, the nature and priorities of bioinformatics
research and applications are changing to focus on deciphering what
the information contained in these genomes tells us. Here are some of
the ways in which people are attacking this problem:




Now that we have sequenced multiple whole genomes we can
look for differences and similarities between all the genes of
multiple species. From such studies we can draw particular
conclusions about species and general ones about evolution. This
kind of science is often referred to as comparative genomics.
There are now technologies designed to measure the relative
number of copies of a genetic message (levels of gene
expression) at different stages in development or disease or in
different tissues. Such technologies, such as DNA microarrays,
are often referred to collectively as functional genomics and
are producing large amounts of data that must be stored and
analyzed using computational methods.
Large-scale methods of investigating the functions and
associations of proteins (for example yeast two-hybrid methods)
are frequently referred to as proteomics (a combination of
protein and genomics).
Medical informatics is the more direct application of genomic
and proteomic technologies to the investigation of disease, and
usually includes the incorporation of traditional clinical data. The
merging of individual clinical data with newer molecular data,
while providing exciting opportunities for finding the causes of
complex diseases like diabetes, heart disease, or cancer, also
provides some unique problems in data management,
integration, and analysis.
How old is the discipline?
"How old is bioinformatics?" The answer to this depends on which
source you choose to read. Bioinformatics is a relatively young field
compared to physics, chemistry, biochemistry, molecular biology, or
computer science, but people have been engaging in some of its
practices since the dawn of molecular biology 40 years ago.
From Attwood and Parry-Smith's "Introduction to Bioinformatics"1,
The term bioinformatics is used to encompass almost all
computer applications in biological sciences, but was
originally coined in the mid-1980s for the analysis of
biological sequence data.
From Mark S. Boguski's article in the "Trends Guide to
Bioinformatics"2,
The term "bioinformatics" is a relatively recent invention,
not appearing in the literature until 1991 and then only in
the context of the emergence of electronic publishing...
...However, some of my role models when I was a
graduate student (Margaret O. Dayhoff, Russell F.
Doolittle, Walter M. Fitch and Andrew D. McLachlan) had
been building databases, developing algorithms and
making biological discoveries by sequence analysis since
the 1960s---long before anyone thought to label this
activity with a special term (if anything it was called
‘molecular evolution’). Even a relatively new kid on the
block, the National Center for Biotechnology Information
(NCBI), is celebrating its 10th anniversary this year,
having been written into existence by US Congressman
Claude Pepper and President Ronald Reagan in 1988. So
bioinformatics has, in fact, been in existence for more than
30 years and is now middle-aged."
1
2
Prentice-Hall 1999 (Longman Higher Education; ISBN 0582327881)
Elsevier, Trends Supplement 1998, p. 1.
Other Fields Related to Bioinformatics
Biophysics
Molecular biology itself grew out of biophysics. The British Biophysical
Society defines biophysics as "an interdisciplinary field which applies
techniques from the physical sciences to understanding biological
structure and function".
Computational Biology
Computational Biology is a broader term than bioinformatics, and
encompasses various computational approaches to modeling biological
problems ranging in scope from ecosystems, to blood flow in the heart,
to a cell, to the dynamics of individual protein molecules.
Genomics
Genomics is the intersection of genetics and bioinformatics, and
involves the analysis or comparison of genomes or subsets of
genomes.
Mathematical Biology
Mathematical biology is less tied to the collection and analysis of
sequence data than bioinformatics, and generally entails developing
mathematical models (or applying existing models) to explain various
features of biological systems.
Proteomics
Proteomics involves characterizing the many
tens of thousands of proteins expressed in a
given cell type at a given time (e.g.
measuring their biochemical properties,
identifying what other proteins and smaller
molecules they interact with, determining the
spatial location within the cell where they are
found, or determining their three-dimensional
structures) and involves the storage and
analysis of vast amounts of data.
Medical Informatics
Medical informatics traditionally deals with the collection and
management of patient data in a health care setting. Biomedical
informatics involves the combination of this traditional patient data
with newer molecular data to investigate questions of disease at the
patient level.
Where to go for more information
The information in this document was adapted from:
http://bioinformatics.org/faq/
Wikipedia has a fairly comprehensive page on Bioinformatics:
http://en.wikipedia.org/wiki/Bioinformatics
The European Bioinformatics Institute (EBI) has an extensive
information and tutorial section on their website titled “2can”:
http://www.ebi.ac.uk/2can/bioinformatics/bioinf_what_1.html
The National Center for Biotechnology Information (part of the
National Institutes of Health, and the home for the BLAST sequence
alignment program) have an education site:
http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html
The Howard Hughes Medical Institute (HHMI) has a variety of
educational materials, both online and available for free
on DVD, relating to modern biomedical science at:
http://www.hhmi.org/biointeractive/index.html
Careers in Bioinformatics
http://www.biohealthmatics.com/careers/biocareerpaths.aspx
Bioinformatician
This will involve the acquisition and analyzing of data from
collaborators, public databases, and genome projects, and scientific
publications.
Biomedical Computer Scientist
The role would involve the design and development of programs
and/or databases to be used in biological field. Strong programming
skills usually a requirement for this role.
Geneticist
These usually fall into three categories – Research Geneticist,
Laboratory Geneticist and Genetic Counselors. The first two usually
bioinformatics degrees at Graduate level and all require a strong
understating of genetics.
Computational Biologist
Computational Biologist develop computational tools and methods to
solve complex theoretical and mathematical problems as they relate to
interpreting genomic information. Knowledge of correlation in
statistical and mathematical analyses with genetic and biological
information. A bioinformatics biologist would have to collaborate with
other researchers and departments and some of his or her duties
would include development of tools that would support research
objectives, the compilation and analysis of data, including writing and
editing reports for journal publication as needed.
Biostatistician
A biostatistician responsibilities include reviewing potential
bioinformatics publications for statistical accuracy, and writing reports
for in-house team members and collaborators, as well as information
gathering on various studies. At a higher level they would ensure the
consistent application of statistical analysis across different studies.
Experience with a wide range of statistical methods, such as ANOVA,
logistic regression analysis, survival analysis, linkage analysis, and
multivariate analysis would be necessary. Proficiency in SPlus or SAS
(computer based statistical analysis tools)might be necessary in some
positions.
Biomedical Chemist
Biomedical Chemists analyze pharmaceutical materials for quality,
purity and strength. They use approved methodology and observe
safety practices. They produce sample batches of a drug for troubleshooting and help design the scaling-up process that takes drug
manufacture up to factory proportions. Advanced positions require
extensive record keeping and the supervision and integration of a lab
team.
Clinical Data Manager
The Clinical Data Manager uses complex computer systems within
bioinformatics environments would need to possess analytical skill to
detect and resolve data problems in clinical research studies. A good
understanding of the data generated in a clinical research study,
methodologies for data storage, reviewing data, database design and
testing, and the ability to extract information are all skills that a CDM
must have.
Molecular Microbiologist
The Microbiologist will support efforts to characterize pathogenic
bacteria. The Microbiologist will determine bacterial/spore resistance
to standard and novel antimicrobials and decontaminants various
conditions. A Bachelor of Science in Microbiology with experience in
general microbiology is sometimes required.
Software/Database Programmer
The bioinformatics programmer is responsible for performing analyses
on data from genomic and other biological databases, clinical trials and
other sources, including listings, tabulations, graphical summaries and
formal statistical estimates and tests. Ability to assess quality of
analysis data, perform cross study analyses and be able to create and
use/write SAS macros to automate all of the above functions.
Additionally, the person in this role will design and create analysis
databases. A thorough knowledge of study design and protocol
requirements is fundamental.
Medical Writer/Technical Writer
The duties of the medical writer comprise assisting departments in the
preparation and writing of documents required for regulatory
submissions, writing study protocol and other documents needed for
clinical studies, and clinical study reports in accordance with regulatory
guidelines. Other tasks include Drafting and coordinating the
preparation of manuscripts for publication. A Master’s or PhD Degree is
usually required for this position.
Research Associates and Research Scientists
An advanced degree is needed. Research Associates participate in and
contribute to a scientific objective. He or She must be conversant with
laboratory equipment and software use as well as safety and protocols.
Research Associates also monitor and collect clinical trial data;
coordinate designed trials; and prepare written reports, protocols, and
study tracking documents. They are responsible for overall site
management, including conducting initiation, interim, and close-out
visits. A high level of interaction is required with physicians,
pharmaceutical companies. Reviewing study documentation and
ensuring compliance with clinical objectives and procedures is also a
requirement for this role.
How does BLAST work?
Overview
BLAST stands for Basic Local Alignment Search Tool. It takes as input
a sequence (either DNA or protein), and returns a list of sequences
from the database that are ranked according to how similar they are to
the input sequence. The underlying premise for this technique is that
sequences that are similar to the input, or query, sequence are
homologous to it; that is, they are derived from a common precursor
sequence. Over time the two sequences have accumulated mutations
and are no longer identical, but we can calculate statistically the
likelihood that they are as similar as they are due to pure chance.
Once the sequences diverge sufficiently from each other, we are no
longer able to tell with certainty whether they are homologous or not –
they are no more similar than sequences from genes that are not
derived from the same common precursor molecule (i.e. unrelated
sequences). This is why the ranking according to similarity is
important – we want to find homologous sequences, because then we
can assume that the genes we find in the database have similar
biological functions.
The BLAST page for doing a nucleotide database search (blastn).
E-values
Each database sequence (or database “hit”) has an E-value listed after
it. The “E” stands for “expectation”. This number tells us the
likelihood that the database and query sequences are not homologous.
This is sort of backwards from what you might expect, but the number
is telling us how many times we should “expect” (hence the name,
“expectation value”) to see that amount of similarity between two
sequences simply by chance (i.e. by searching a database of randomly
generated DNA sequences). An E-value of 1 means we would expect
to see that level of similarity at least once each time we search the
database with our query sequence. An E-value of 0.1 (or 10-1 in
scientific notation) means we would expect to see that level of
similarity 1 in every 10 times we searched the database. Usually
scientists use an E-value of 0.001 (or 10-3) as a cutoff – if the E-value
is smaller than this then the “hit” is assumed to be a homologous gene
(since there is only a 1-in-one-thousand chance that the two proteins
are not homologous), while if it is larger than this then we assume that
we cannot tell for sure whether the “hit” is homologous or not – the
two sequences are not sufficiently similar for us to make that
determination.
These sequences in the database can be considered homologs to the query
sequence, since there is only 1 chance in 3 x 1028 that the level of similarity
between the query and database sequences are due to chance.
These sequences cannot be assumed to be homologs to the query sequence,
since there is roughly a 1 out of 1 chance that we would see this level of
similarity between two sequences simply by chance.
Alignments
The way BLAST comes up with the similarity measure between two
sequences is to align them. It does this by going through a process of
matching up the appropriate letters (A, G, T, and C, representing the 4
bases, or building blocks, of DNA), maximizing the number of
alignments between like letters (e.g. an A aligned to an A), and
minimizing the number of mismatches (e.g. an A aligned to a G). I
does this by giving high scores to matches and negative scores to
mismatches, then adding up the score for the whole alignment. You
can view the alignment between the query sequence and any of the
top-scoring database sequences in the BLAST output.
A BLAST alignment.
Gaps
One of the ways that sequences diverge from each other over time is
to either add or delete nucleotides (e.g. the letters) from one
sequence or the other. This is accounted for by putting gaps in the
alignment – these are represented in the BLAST output by aligning a
letter in one sequence with a dash (-) in the other sequence. These
are also given negative scores, just as with mismatched letters.
A BLAST alignment of protein sequences with gaps (lower right-hand
corner). Note that gaps (shown as dashes) can appear in either the query
sequence or the database sequence.
Organisms
One of the useful things about the BLAST site is that one can quickly
find out which organism a sequence belongs to. This can be done for
single sequences by clicking on the sequence description itself, which
takes you to a page giving information about that particular sequence,
or for all the hits at once by clicking on the “Taxonomy Reports” list.
Sometimes it make take a bit of sleuthing to figure out what the
scientific name means in everyday language (e.g. that Arabidopsis
thalania is a plant and Drosophila melanogaster is a fruit fly), but this
is very useful information to have when evaluating the list of hits to
your query sequence.
Click on the “Taxonomy Reports” link to see more detailed information
about the organisms that the sequences found by BLAST come from.
Download