BioInformatics at FSU - Department of Biological Science

advertisement
BioInformatics at FSU
– what it is, who’s doing
it, and why it needs to be
done now.
Steve Thompson
Florida State University School of
Computational Science and
Information Technology (CSIT)
Introductory outline:
What is bioinformatics, genomics,
sequence analysis, computational
molecular biology . . .
Reverse Biochemistry & Evolution.
Database growth & cpu power.
A very brief ‘Show and Tell,’
NCBI Resources, GCG’s SeqLab,
phylogenetics.
High quality training is essential!
Graduates need to be competitive on a
My
definitions:
Biocomputing
and computational biology are synonymous
and describe the use of computers and computational
techniques to analyze any biological system, from
molecules, through cells, tissues, and organisms, all
the way to populations.
Bioinformatics describes using computational techniques
to access, analyze, and interpret the biological
information in any of the available biological
databases.
Sequence analysis is the study of molecular sequence
data for the purpose of inferring the function,
mechanism, interactions, evolution, and perhaps
structure of biological molecules.
Genomics analyzes the context of genes or complete
genomes (the total DNA content of an organism) within
and across genomes.
Proteomics is the subdivision of genomics concerned with
analyzing the complete protein complement, i.e. the
proteome, of organisms, both within and between
The
reverse
analogy:
from
a ‘virtual’ biochemistry
DNA sequence to actual
molecular physical characterization, not the
other way ‘round.
Using bioinformatics tools, you can infer all
sorts of functional, evolutionary, and,
structural insights into a gene product,
without the need to isolate and purify
massive amounts of protein! Eventually
you can go on to clone and express the
gene based on that analysis using PCR
techniques.
The computer and molecular databases
The exponential growth of molecular
sequence databases & cpu power.
Year
BasePairs
1982
680338
1983
2274029
1984
3368765
1985
5204420
1986
9615371
1987 15514776
1988 23800000
1989 34762585
1990 49179285
1991 71947426
1992 101008486
1993 157152442
1994 217102462
1995 384939485
1996 651972984
1997 1160300687
1998 2008761784
1999 3841163011
2000 11101066288
2001 15849921438
2002 28507990166
Sequences
606
2427
4175
5700
9978
14584
20579
28791
39533
55627
Doubling
time ~ 1
year!
78608
143492
215273
555694
1021211
1765847
2837897
4864570
10106023
14976310
22318883
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.ht
ml
Database growth (cont.)
The Human Genome Project and numerous other genome projects
have kept the data coming at alarming rates. As of April 2003,
(50 years after the Watson-Crick double-helix!)16 Archaea, 128
Bacteria, and 10 Eukaryote complete, finished genomes; and 4
Vertebrate and 5 Plant essentially complete genome maps are
publicly available for analysis; not counting all the virus and
viroid genomes available.
The International Human Genome Sequencing Consortium
announced the completion of a "Working Draft" of the human
genome in June 2000; independently that same month, the
private company Celera Genomics announced that it had
completed the first assembly of the human genome. Both
articles were published mid-February 2001 in the journals
Some neat stuff from those papers:
We, Homo sapiens, aren’t nearly as special as
we had once hoped we were. Of the 3.2 billion
base pairs in our DNA —
Traditional, text-book estimates of the number of
genes were often in the 100,000 range; turns out
we’ve only got about twice as many as a fruit fly,
between 25,000 and 35,000!
The protein coding region of our genome is only about
1% or so, much of the remainder ‘junk’ is ‘jumping,’
‘selfish DNA’ of which much may be involved in
regulation and control. Understanding this network
is a huge challenge.
100-200 genes were transferred from an ancestral
bacterial genome to an ancestral vertebrate
genome! (Later shown to be not true by more extensive
analyses, and to be due to gene loss rather than transfer.)
What are primary
(Central Dogma: DNA —> RNA —> protein)
sequences?
Primary refers to one dimension — all of the ‘symbol’
information written in sequential order necessary to
specify a particular biological molecular entity, be it
polypeptide or nucleotide.
The symbols are the one letter alphabetic codes for all
of the biological nitrogenous bases and amino acid
residues and their ambiguity codes. Biological
carbohydrates, lipids, and structural information are
not included within this sequence, however, much of
this type of information is available in the reference
documentation sections associated with primary
What are sequence databases?
These databases are an organized way to store the
tremendous amount of sequence information that
accumulates from laboratories worldwide. Each
database has its own specific format. Three major
database organizations around the world are
responsible for maintaining most of this data; they
largely ‘mirror’ one another.
North America: National Center for Biotechnology
Information (NCBI): GenBank & GenPept.
Also Georgetown University’s NBRF Protein
Identification Resource: PIR & NRL_3D.
Europe: European Molecular Biology Laboratory (also
EBI & ExPasy): EMBL & Swiss-Prot.
Asia: The DNA Data Bank of Japan (DDBJ).
Content
& organization:
Most sequence database
installations are examples of complex
ASCII/Binary databases, but they usually are not Oracle or SQL or
Object Oriented (proprietary ones often are). They often contain
several very long text files containing different types of information
all related to particular sequences, such as all of the sequences
themselves, versus all of the title lines, or all of the reference
sections. Binary files often help ‘glue together’ all of these other
files by providing index functions.
Software is usually required to successfully interact with these
databases and access is most easily handled through various
software packages and interfaces, either on the World Wide Web
or otherwise. Nucleic acid databases are split into subdivisions
based on taxonomy (historical). Protein databases are often
What are other biological databases?
Three dimensional structure databases:
the Protein Data Bank and Rutgers Nucleic Acid Database.
Still more; these can be considered ‘non-molecular’:
Reference Databases: e.g.
OMIM — Online Mendelian Inheritance in Man
PubMed/MedLine — over 11 million citations from more than
4 thousand bio/medical scientific journals.
Phylogenetic Tree Databases: e.g. the Tree of Life.
Metabolic Pathway Databases: e.g. WIT (What Is There) and
Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of
Genes and Genomes).
Population studies data — which strains, where, etc.
And then databases that most biocomputing folk don’t even usually
consider:
e.g. GIS/GPS/remote sensing data, medical records, census
So how do you do bioinformatics?
Often on the InterNet over the World Wide Web —
Site
URL (Uniform Resource Locator)
Content
Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/
databases/analysis/software
PIR/NBRF
http://www-nbrf.georgetown.edu/
protein sequence database
IUBIO Biology Archive
http://iubio.bio.indiana.edu/
database/software archive
Univ. of Montreal
http://megasun.bch.umontreal.ca/
database/software archive
Japan's GenomeNet
http://www.genome.ad.jp/
databases/analysis/software
European Mol' Bio' Lab'
http://www.embl-heidelberg.de/
databases/analysis/software
European Bioinformatics http://www.ebi.ac.uk/
databases/analysis/software
The Sanger Institute
http://www.sanger.ac.uk/
databases/analysis/software
Univ. of Geneva BioWeb http://www.expasy.ch/
databases/analysis/software
ProteinDataBank
http://www.rcsb.org/pdb/
3D mol' structure database
Molecules R Us
http://molbio.info.nih.gov/cgi-bin/pdb/
3D protein/nuc'
http://www.gdb.org/
The Human Genome
visualization
The Genome DataBase
What other resources are
available?
Desktop software solutions — public domain programs
are available, but . . . complicated to install, configure,
and maintain. User must be pretty computer savvy.
So,
commercial software packages are available, e.g.
MacVector, DS Gene, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per machine, and
Internet and/or CD database access all complicate
matters!
Therefore, UNIX server-based solutions, public domain or
commercial (e.g. the Accelrys GCG Wisconsin
Package [a Pharmacopeia Co.]): the SeqLab Graphical User
Interface.
University bioinformatics objectives:
The university tripartite mission —
Education, Research, and Service.
Education: out-reach programs and
undergraduate and graduate
courses.
Research: bioinformatics is becoming
an indispensable tool in most
biological research, particularly in
molecular and cellular biology.
Service: those faculty and staff that
know bioinformatics should be
available to assist with consultation,
Education:
Workshops — continue to teach GCG SeqLab tutorial
series; each of the four sessions offered once per
semester.
Modules —across the university curricula within existing
courses, interdisciplinary by nature, implications, &
ethics.
Graduate and Undergraduate Courses —
presently three cross-listed biology courses; one
introductory, team-taught survey, stressing practical,
project-oriented approaches; one advanced algorithms
lecture; one programming practicum.
Computational Molecular Biology Program —
proposed; to be in association and cooperating with
students’ present major department, coordinated by
CSIT. Pros and cons . . .
GCG SeqLab workshop series:
Four different sessions —
Intro’ to SeqLab & Multiple
Sequence Analysis and its
supplement,
Rational Primer Design,
Database Searching & Pairwise
Comparisons — Significance,
Molecular Evolutionary
http://bio.fsu.edu/~stevet/workshop.html
Phylogenetics.
FOR MORE INFO...
Modules in existing courses:
Cooperate with extant programs to
incorporate bioinformatics into their
existing curricula.
Key is to demonstrate necessity of
knowledge & offer full cooperation with
departments.
Potential courses exist across many
different departments, and even
across different colleges. Identify
potential courses from the General
Catalog and approach individual
Courses at Florida State:
Four different Special Topics Biology
Department BSC4933/5936
Bioinformatics sections —
First (first offered Spring 2002)—
“Introduction to Bioinformatics” Steve Thompson
et al.
Covers both sequence and structural analysis.
Team-taught; lecture + optional lab; pragmatic,
real-world, project-oriented approach.
Survey level — introduction to the theory +
practical applications. Pluses and minuses;
Courses (cont.)
Second (first offered Fall 2002) —
“Programming Skills for Computational
Biology and Bioinformatics” David
Swofford. The Java model, an object
oriented framework.
Third (first offered Spring 2003) —
“Advanced Bioinformatics:
Computational Methods” David
Swofford. The theory behind
sequence analysis algorithms.
New (Fall 2003) — “Genomics and
Courses (cont.)
Departments other than Biology:
Mathematics —
MAP 5485 “Introduction to Mathematical
Biophysics” Jack Quine. Mathematical tools
in Biophysics.
an integral part of their Biomedical
Mathematics Program.
Institute of Molecular Biophysics — Center Of
Excellence In Biomolecular Computer
Modeling & Simulation.
In all courses —
don’t ignore implications, ramifications, &
ethics of bioinformatics research.
Undergraduate opportunities:
A special undergraduate fellowship —
The Howard Hughes Undergraduate
Program in Mathematical and
Computational Biology.
Twelve Hughes Fellows per year earn a
$5000 stipend, a $1200 summer housing
allowance, and a $1000 professional
meeting allowance.
Supported by two new undergraduate
Computational Biology Program:
Presently FSU computational biology is
composed of a confusing mix of
undergraduate and graduate programs
across at least three different
Departments from the College of Arts and
Sciences.
We propose a CSIT Coordinated ‘balloon’
program in association with student’s
major department that would consolidate
these efforts. Pros and Cons . . .
Undergraduate and/or Graduate?
Avoid duplication of effort.
Candidate department collaborations.
Summer short course:
Long-range ‘pipe dream’?
Broad spectrum — both instructors
and students from many different
disciplines and world-wide
distribution.
See MBL Mol’ Evol’ Workshop for a
model.
One or two weeks?
On-campus room & board support.
To be undertaken this course will
Bioinformatics degree programs
around the world:
Relatively rare, but more are being
created all the time. Biocomputing
education URL’s are documented at:
http://www.csit.fsu.edu/HHP/gradprog.ht
ml
Most are graduate course lists, many
are graduate Masters or Ph.D.
programs, some include
undergraduate courses or programs.
There is a huge need for
Conclusions:
Gunnar von Heijne in his old but quite readable treatise,
Sequence Analysis in Molecular Biology; Treasure Trove
or Trivial Pursuit (1987), provides a very appropriate
conclusion:
“Think about what you’re doing; use your knowledge of the
molecular system involved to guide both your interpretation of
results and your direction of inquiry; use as much information as
possible; and do not blindly accept everything the computer offers
you.”
He continues:
“. . . if any lesson is to be drawn . . . it surely is that to be able to
make a useful contribution one must first and foremost be a
biologist, and only second a theoretician . . . . We have to
FOR MORE INFO...
Humana Press, Inc. also
asked me to contribute.
http://bio.fsu.edu/~stevet/cv.html.
I’ve got two chapters in
Contact me (stevet@bio.fsu.edu) for their —
specific bioinformatics assistance
Introduction to
and/or long distance collaboration.
Bioinformatics:
Many fine texts are also starting to
A Theoretical And
become available in the field.
Practical Approach
To ‘honk-my-own-horn’ a bit, check
http://www.humanapress.c
out the new —
om/Product.pasp?txtCatal
Current Protocols in Bioinformatics
og=HumanaBooks&txtCat
from John Wiley & Sons, Inc:
egory=&txtProductID=1http://www.does.org/cp/bioinfo.htm 58829-241-X&isVariant=0.
l.
Both volumes are now
They asked me to contribute a
available.
chapter on multiple sequence
Visit my Web page:
Download