Access to Sequence Data & Literature Information LESSON 2 BIOINFORMATICS APPLICATIONS 1. Information Search in databases (DBs) Scientific literature : PubMed …… Gene : GenBank, EMBL-EBI, DDBJ …… Protein : UniProt, Protein Data Bank …… 2. Data Analysis using programs (PGs) Sequence similarity search : Blast, Fasta …… Bioinformatics Portals 1. NCBI (http://www.ncbi.nlm.nih.gov/) 2. EBI (http://www.ebi.ac.uk/) 3. ExPASy (http://www.expasy.org/) Databases ? O Collections of Data O Sophisticated arrangements of storage O data in a structured manner O Can be manipulated by software backbone of bioinformatics research • Databases can be accessed locally or online and often link to each other The data come from different sources GenBank DATABASE OF MOST KNOWN NUCLEOTIDE AND PROTEIN SEQUENCES (http://www.ncbi.nlm.nih.gov/genbank/) Growth of GenBank & WGS Dec 2013, 169331407 Dec 1982, 606 Top organisms in GenBank Nucl. Acids Res. (1 January 2014) 42 (D1): D32-D37. insdc.org - EBI • Primary sequence databases hold many millions of nucleic acid sequence records. Access to Information Nucleotide DATABASES Nucleotide (http://www.ncbi.nlm.nih.gov/nucleotide/) Gene (http://www.ncbi.nlm.nih.gov/gene) protein-coding region Start codon : Stop codon : : Coding Segments (= CDS) : Open Reading Frame (= ORF ) ATG → Met TAA, TAG, TGA Untranslated Region (= UTR) Accession number label for sequence String of letters and/or numbers that corresponds to a molecular sequence. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Protein coding segment start end To use the nucleotide sequence as input for other P/Gs. (FASTA: sequence-alignment-and-database-scanning P/G) versatile, compact with one header line followed by a string of nucleotides or amino acids in capital letters for the single letter codes >My Sequence Name ARCGTCRGCKINTANDCKINTANDARCGCKINTANDRGCKINT ANDNTANDARCGCKINTANDARNDBCQEDNBNCDNDNQENNDN Capital letters for the one-letter codes No space between codes Courier font for easy alignment ARCGTCRGCKINTANDCKINTANDARCGCKINTANDRGCKINT ANDNTANDARCGCKINTANDARNDBCQEDNBNCDNDNQENNDN DNA • provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. • RefSeq identifiers include the following formats: Complete genome Complete chromosome Genomic contig mRNA (DNA format) Protein NC_###### NC_###### NT_###### NM_###### e.g. NM_000518 NP_###### e.g. NP_000509 Gene mRNA Protein “Gene” at NCBI offers a wealth of information • • • • • • • • Genomic context Bibliography Phenotypes Gene Ontology (organizing principles of biological process, molecular function, cellular component) Reference sequences Additional (non-RefSeq sequences) Many, many links to NCBI resources Many, many links to external resources Access to Information Protein DATABASES PIR (http://pir.georgetown.edu/) Protein (http://www.ncbi.nlm.nih.gov/protein/) EBI Proteins (http://www.ebi.ac.uk/services/proteins) ExPASy (http://www.expasy.org/) UniProt (http://www.uniprot.org/) Universal Protein Resource •TrEMBL Automatic Translation of European Molecular Biology Laboratory nucleotide sequences A logic-based organizational structure for knowledge. A set of field-specific descriptors enabling the sharing of same concepts and definitions for specific terms. Scientific data sharing made easy Integration of the complex data bridge the gap between different biological communities Access to Information GENOME BROWSERS Ensembl (http://www.ensembl.org/) UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway) Map viewer (http://www.ncbi.nlm.nih.gov/mapview/) GENOME? The total genetic content (information) contained in a full set of chromosomes. Access to Information Scientific Literature Search PubMed (http://www.ncbi.nlm.nih.gov/pubmed) Bookshelf (http://www.ncbi.nlm.nih.gov/books/)