Accession number Accession number

advertisement
Access to Sequence Data &
Literature Information
LESSON 2
BIOINFORMATICS APPLICATIONS
1. Information Search in databases (DBs)
Scientific literature
: PubMed ……
Gene
: GenBank, EMBL-EBI, DDBJ ……
Protein
: UniProt, Protein Data Bank ……
2. Data Analysis using programs (PGs)
Sequence similarity search
: Blast, Fasta ……
Bioinformatics Portals
1.
NCBI (http://www.ncbi.nlm.nih.gov/)
2.
EBI (http://www.ebi.ac.uk/)
3.
ExPASy (http://www.expasy.org/)
Databases ?
O Collections of Data
O Sophisticated arrangements of storage
O data in a structured manner
O Can be manipulated by software
backbone of bioinformatics research
• Databases can be
accessed locally or
online and often link to
each other
 The data come from
different sources
GenBank
DATABASE OF MOST KNOWN NUCLEOTIDE AND
PROTEIN SEQUENCES
(http://www.ncbi.nlm.nih.gov/genbank/)
Growth of GenBank & WGS
Dec 2013, 169331407
Dec 1982, 606
Top organisms in GenBank
Nucl. Acids Res. (1 January 2014) 42 (D1): D32-D37.
insdc.org
- EBI
• Primary sequence databases hold many millions of
nucleic acid sequence records.
Access to Information
Nucleotide DATABASES
Nucleotide (http://www.ncbi.nlm.nih.gov/nucleotide/)
Gene (http://www.ncbi.nlm.nih.gov/gene)

protein-coding region
 Start codon :
 Stop codon :

: Coding Segments (= CDS)
: Open Reading Frame (= ORF )
ATG → Met
TAA, TAG, TGA
Untranslated Region (= UTR)
Accession number

label for sequence

String of letters and/or numbers that corresponds to
a molecular sequence.

DNA sequences and other molecular data are tagged with
accession numbers that are used to identify a sequence or
other record relevant to molecular data.
Protein
coding
segment
start
end
To use the nucleotide sequence as input for other P/Gs.
(FASTA: sequence-alignment-and-database-scanning P/G)

versatile, compact with one header line followed by a string
of nucleotides or amino acids in capital letters for the single
letter codes
>My Sequence Name
ARCGTCRGCKINTANDCKINTANDARCGCKINTANDRGCKINT
ANDNTANDARCGCKINTANDARNDBCQEDNBNCDNDNQENNDN

Capital letters for the one-letter codes
 No space between codes
 Courier font for easy alignment
ARCGTCRGCKINTANDCKINTANDARCGCKINTANDRGCKINT
ANDNTANDARCGCKINTANDARNDBCQEDNBNCDNDNQENNDN
DNA
• provides an expertly curated accession number that
corresponds to the most stable, agreed-upon “reference”
version of a sequence.
• RefSeq identifiers include the following formats:
Complete genome
Complete chromosome
Genomic contig
mRNA (DNA format)
Protein
NC_######
NC_######
NT_######
NM_###### e.g. NM_000518
NP_###### e.g. NP_000509
Gene
mRNA
Protein
“Gene” at NCBI offers a wealth of information
•
•
•
•
•
•
•
•
Genomic context
Bibliography
Phenotypes
Gene Ontology (organizing principles of biological
process, molecular function, cellular component)
Reference sequences
Additional (non-RefSeq sequences)
Many, many links to NCBI resources
Many, many links to external resources
Access to Information
Protein DATABASES
PIR (http://pir.georgetown.edu/)
Protein (http://www.ncbi.nlm.nih.gov/protein/)
EBI Proteins (http://www.ebi.ac.uk/services/proteins)
ExPASy (http://www.expasy.org/)
UniProt (http://www.uniprot.org/)
Universal Protein Resource
•TrEMBL
Automatic Translation of
European Molecular Biology Laboratory
nucleotide sequences

A logic-based organizational structure for knowledge.

A set of field-specific descriptors enabling the sharing of same
concepts and definitions for specific terms.

Scientific data sharing made easy

Integration of the complex data

bridge the gap between different biological
communities
Access to Information
GENOME BROWSERS
Ensembl (http://www.ensembl.org/)
UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway)
Map viewer (http://www.ncbi.nlm.nih.gov/mapview/)
GENOME?

The total genetic content (information)
contained in a full set of chromosomes.
Access to Information
Scientific Literature Search
PubMed (http://www.ncbi.nlm.nih.gov/pubmed)
Bookshelf (http://www.ncbi.nlm.nih.gov/books/)
Download