Power Point

advertisement
Organizing information in the post-genomic era
The rise of bioinformatics
An information explosion!
Bioinformatics
Computational tools are
developed to collect,
organize and analyze a
wide variety of
biological data
Advances in DNA sequencing technologies have
accelerated the pace of discovery. Much of
the process is now automated.
What is a database?
Which databases are important for molecular cell biology
research?
How is information processed in databases?
Literature
Nucleotide
Protein
Organism
Function
Structure
Biological databases use different organizing principles
Hyperlinks connect records in different databases
Databases are organized collections of information
Information is stored in records
Databases assign each record a
unique accession number using their
own numbering system
Fields are used to cross-reference
the data. Records can be searched
by fields.
Data is entered in the record using
a defined format
Accession #
Field 1 ........................
Field 2 .......................
Data
............................................
............................................
Bioinformaticians work with computer scientists to set up the
database structure
Curators review and link records within and between databases
The information in databases ultimately derives from experimental data
to
PubMed
Researchers do
experiments
Researchers analyze
data and write papers
Data is published
in journals
Curators will process the submissions and link entries in
different databases
What is a database?
Which databases are important for molecular cell biology
research?
How is information processed in databases?
Biologists use hundreds of different databases from around the
world, some with similar foci
Largest collection is housed at the
National Center for Biotechnology
Information (NCBI), part of the
National Library of Medicine
NLM-NCBI complex in Bethesda MD
Large staff of curators process the information and
compile information into derivative databases
NCBI maintains both primary and derivative databases
We’ll look at three of them
PubMed is the premier literature database in the world
SGD is a derivative database serving the yeast research community
Grew out of decades of research
Genome project provided a systematic organization for genes
Questions for today:
What is a database?
Which databases are important for molecular cell biology
research?
How is information processed in databases?
Curators are responsible for data flow between the NCBI databases
GenBank
Nucleotide
sequences
Automated
translations of
nucleotide
sequences
Annotated nucleic acid sequences are submitted to
GenBank from many sources, including genome
projects, individual investigators, and other databases
– there is considerable REDUNDANCY in the
information
Sequences are compiled to
generate non-redundant
reference sequences
RefSeq
Non-redundant
nucleotide and
protein sequences
Protein
Amino acid
sequences
Experimentally determined amino acid sequences and
information from other protein databases
Most records in the Protein database have been derived by
automated translation of nucleotide sequences
On a larger scale: Genome projects have produced
the reference sequences in nucleotide databases
(robots and computers do much of the work)
1. Pieces of chromosomal DNA are sequenced, each ~1000 bp long
S. cerevisiae genome is ~12 Mbp – how many reads would be necessary
to cover each base pair in the genome once?
2. Overlapping sequence reads are aligned until sequences of
entire chromosomes were complete
Computer algorithms identify
areas of sequence overlap
Process is repeated to align long
stretches of sequence
Complete chromosome
sequences are submitted to
GenBank
GenBank NC_####### (non-redundant chromosome) sequences
3. Chromosomal sequences are analyzed for the presence of
potential transcripts (open reading frames; ORFs)
ORFs are characterized by an under-representation of stop codons
ORF-finding computer algorithms look for sequences that
• begin with a methionine
• methionine is separated from a stop codon in the same reading frame
by a large number of amino acids (often 100, equiv. to 300bp)
GenBank NM_####### records are predicted ORFs
4. Protein sequences are computationally predicted from ORF
sequences
GenBank NP_###### records
Genes were given systematic (locus) names by their positions on chromosomes
Systematic name for MET1: YKR069W
Y
(A-P)
(L or R)
(ORF number)
left or right arm
of chromosome
yeast
ORF number,
counting away from
the centromere
(position = 0)
chromosome
1=A
2=B
etc.
(W or C)
sense strand is
Watson or Crick
strand (coding
sequence is read
5’ to 3’)
W
C
Left arm
W
centromere
C
Right arm
Literature
Nucleotide
Protein
Organism
Function
Structure
Download