Organizing information in the post-genomic era The rise of bioinformatics An information explosion! Bioinformatics Computational tools are developed to collect, organize and analyze a wide variety of biological data Advances in DNA sequencing technologies have accelerated the pace of discovery. Much of the process is now automated. What is a database? Which databases are important for molecular cell biology research? How is information processed in databases? Literature Nucleotide Protein Organism Function Structure Biological databases use different organizing principles Hyperlinks connect records in different databases Databases are organized collections of information Information is stored in records Databases assign each record a unique accession number using their own numbering system Fields are used to cross-reference the data. Records can be searched by fields. Data is entered in the record using a defined format Accession # Field 1 ........................ Field 2 ....................... Data ............................................ ............................................ Bioinformaticians work with computer scientists to set up the database structure Curators review and link records within and between databases The information in databases ultimately derives from experimental data to PubMed Researchers do experiments Researchers analyze data and write papers Data is published in journals Curators will process the submissions and link entries in different databases What is a database? Which databases are important for molecular cell biology research? How is information processed in databases? Biologists use hundreds of different databases from around the world, some with similar foci Largest collection is housed at the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine NLM-NCBI complex in Bethesda MD Large staff of curators process the information and compile information into derivative databases NCBI maintains both primary and derivative databases We’ll look at three of them PubMed is the premier literature database in the world SGD is a derivative database serving the yeast research community Grew out of decades of research Genome project provided a systematic organization for genes Questions for today: What is a database? Which databases are important for molecular cell biology research? How is information processed in databases? Curators are responsible for data flow between the NCBI databases GenBank Nucleotide sequences Automated translations of nucleotide sequences Annotated nucleic acid sequences are submitted to GenBank from many sources, including genome projects, individual investigators, and other databases – there is considerable REDUNDANCY in the information Sequences are compiled to generate non-redundant reference sequences RefSeq Non-redundant nucleotide and protein sequences Protein Amino acid sequences Experimentally determined amino acid sequences and information from other protein databases Most records in the Protein database have been derived by automated translation of nucleotide sequences On a larger scale: Genome projects have produced the reference sequences in nucleotide databases (robots and computers do much of the work) 1. Pieces of chromosomal DNA are sequenced, each ~1000 bp long S. cerevisiae genome is ~12 Mbp – how many reads would be necessary to cover each base pair in the genome once? 2. Overlapping sequence reads are aligned until sequences of entire chromosomes were complete Computer algorithms identify areas of sequence overlap Process is repeated to align long stretches of sequence Complete chromosome sequences are submitted to GenBank GenBank NC_####### (non-redundant chromosome) sequences 3. Chromosomal sequences are analyzed for the presence of potential transcripts (open reading frames; ORFs) ORFs are characterized by an under-representation of stop codons ORF-finding computer algorithms look for sequences that • begin with a methionine • methionine is separated from a stop codon in the same reading frame by a large number of amino acids (often 100, equiv. to 300bp) GenBank NM_####### records are predicted ORFs 4. Protein sequences are computationally predicted from ORF sequences GenBank NP_###### records Genes were given systematic (locus) names by their positions on chromosomes Systematic name for MET1: YKR069W Y (A-P) (L or R) (ORF number) left or right arm of chromosome yeast ORF number, counting away from the centromere (position = 0) chromosome 1=A 2=B etc. (W or C) sense strand is Watson or Crick strand (coding sequence is read 5’ to 3’) W C Left arm W centromere C Right arm Literature Nucleotide Protein Organism Function Structure