Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Objectives of today lecture Understand the purpose of, and use of, bioinformatics databases resources, such as GenBank,UniProt/Swiss-Prot, Entrez and Ensembl. Be able to recognize common database data formats and sequence features, sequence and genome browsers. What kind of tools are available to visualize sequence data? Appreciate the issues surrounding bioinformatic database updating. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Biological Databases and Data Models Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Databases in general http://www.oxfordjournals.org/nar/database/c/ Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Also check out the annual “web-software” issue of NAR every July Databases Organized array of information On the WWW or Local Place where you put things in, and (if all goes well!) you should be able to get them out again. Allows you to make discoveries. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Useful Database • Secondary (curated) • Primary (archival) – GenBank/EMBL/DDBJ (seqs) – PDB -(protein structures) – Medline (literature) – IMEx databases (protein interactions) – RefSeq (seqs) – UniProt - SwissProt (seqs) – Taxon (taxonomy) – PROSITE (binding sites) – OMIM (genetics literature/reviews) – IMEx databases (protein interactions) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Sequence Databases DNA NCBI: GenBank -> RefSeq National Center for Biotechnology Information www.ncbi.nlm.nih.gov EBI: EMBL European Bioinformatics Institute www.ebi.ac.uk Protein NCBI: GenPept TrEMBL= “translated EMBL” EBI: UniProt: TrEMBL -> UniProt: Swiss-Prot Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization NCBI: GenBank -> RefSeq Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Further Readings!!! http://www.ncbi.nlm.nih.gov/books/NBK21105/ Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization EBI: EMBL Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization UniProt: Swiss-Prot, TrEMBL Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization UniProt: Swiss-Prot: An example of curated, reviewed annotation Incorporates: Function of the protein Subcellular localization of protein Post-translational modification Domains and sites Secondary structure Quaternary structure Similarities to other proteins Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization INSDC - International Nucleotide Sequence Database Collaboration Entrez NIH NCBI GenBank •Submissions •Updates •Submissions •Updates EMBL DDBJ EBI CIB NIG •Submissions •Updates EMBL getentry National Institute of genetics SRS Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization File Formats Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization LOCUS DEFINITION AF115338 591 bp DNA linear BCT 19-AUG-1999 Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds. ACCESSION AF115338 VERSION AF115338.1 GI:4959391 KEYWORDS . SOURCE Pseudomonas fluorescens. ORGANISM Pseudomonas fluorescens Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. REFERENCE 1 (bases 1 to 591) AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R. TITLE Influence of a putative ECF sigma factor on expression of the major outer membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas fluorescens JOURNAL J. Bacteriol. 181 (16), 4746-4754 (1999) MEDLINE 99369842 PUBMED 10438740 REFERENCE 2 (bases 1 to 591) AUTHORS De Mot,R. TITLE Direct Submission JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium FEATURES Location/Qualifiers source 1..591 /organism="Pseudomonas fluorescens" /strain="M114" /db_xref="taxon:294" gene 1..591 /gene="sigX" CDS 1..591 /gene="sigX" /codon_start=1 /transl_table=11 /product="ECF sigma factor SigX" /protein_id="AAD34329.1" /db_xref="GI:4959392" /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET" BASE COUNT 157 a 133 c 170 g 131 t ORIGIN 1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag Lecture 2. Genome Analysis: Databases, Seq 241 gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag Formats &ccccctcgag Visualization 301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga 361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg GenBank Flat File Header •Title •Taxonomy •Citation Features (AA seq) DNA Sequence ID AC SV DT DT DE KW OS OC RN RP RX RA RT RT RT RL RN RP RA RT RL RL RL DR FH FH FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT SQ AF115338 standard; DNA; PRO; 591 BP. AF115338; AF115338.1 03-JUN-1999 (Rel. 59, Created) 23-AUG-1999 (Rel. 60, Last updated, Version 2) Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds. . Pseudomonas fluorescens Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. [1] 1-591 MEDLINE; 99369842. Brinkman F.S., Schoofs G., Hancock R.E., De Mot R.; "Influence of a putative ECF sigma factor on expression of the major outer membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas fluorescens"; J. Bacteriol. 181(16):4746-4754(1999). [2] 1-591 De Mot R.; ; Submitted (04-DEC-1998) to the EMBL/GenBank/DDBJ databases. F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium SPTREMBL; Q9X4L7; Q9X4L7. Key Location/Qualifiers EMBL Flat File Header •Title •Taxonomy •Citation source 1..591 /db_xref="taxon:294" /organism="Pseudomonas fluorescens" /strain="M114" CDS 1..591 /codon_start=1 /db_xref="SPTREMBL:Q9X4L7" /transl_table=11 /gene="sigX" /product="ECF sigma factor SigX" /protein_id="AAD34329.1" /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQR TLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYRKE RRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELEFQE IADIMHMGLSATKMRYKRALDKLREKFAGETET" Sequence 591 BP; 157 A; 133 C; 170 G; 131 T; 0 other; atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 60 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 120 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 180 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt Lecture atggcctgaa gaacctcgag 240 Seq 2. Genome Analysis: Databases, gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag Formats & Visualization 300 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag 360 Features (AA seq) DNA Sequence UniProt: Swiss-Prot ID CYS3_YEAST STANDARD; PRT; 393 AA. AC P31373; DT 01-JUL-1993 (REL. 26, CREATED) DT 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) DT 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONA GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. OS SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). ID CYS3_YEAST STANDARD; PRT; 393 AA. OC EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAR AC P31373; OC SACCHAROMYCETACEAE; SACCHAROMYCES. DT 01-JUL-1993 (REL. 26, CREATED) RN [1] RP SEQUENCE FROM N.A., AND PARTIAL SEQUENCE. DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). RX MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan] GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. RA ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S OS TAXONOMY RA OHMORI S., OSHIMA T., TOH-E A.; OC SACCHAROMYCETACEAE; SACCHAROMYCES. RT "Cloning and characterization of the CYS3 (CYI1) gene of RT Saccharomyces cerevisiae."; RX CITATION RL J. BACTERIOL. 174:3339-3347(1992). CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + CC NH(3) + 2-OXOBUTANOATE. CC NH(3) + 2-OXOBUTANOATE. CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SY CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. CC L-CYSTEINE FROM L-METHIONINE. CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING CC -!- SUBUNIT: HOMOTETRAMER. CC L-CYSTEINE FROM L-METHIONINE. CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC. CC -!- SUBUNIT: HOMOTETRAMER. CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FA CC -------------------------------------------------------------------------CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC. CC This SWISS-PROT entry is copyright. It is produced through a collaboration CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. CC between the Swiss Institute of Bioinformatics and the EMBL outstation CC ------------------------------------------------------------------------CC the European Bioinformatics Institute. There are no restrictions on its CC Disclaimer CC use by non-profit institutions as long as its content is in no way CC modified and this statement is not removed. Usage by and for commercial CC -------------------------------------------------------------------------CC entities requires a license agreement (See http://www.isb-sib.ch/announce/ CC or send an email to license@isb-sib.ch). DR DATABASE cross-reference CC -------------------------------------------------------------------------KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. DR EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] FT INIT_MET 0 0 DR EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] FT BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). DR PIR; S31228; S31228. SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; DR YEPD; 5280; -. TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL DR SGD; L0000470; CYS3. [SGD / YPD] ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE DR PFAM; PF01053; Cys_Met_Meta_PP; 1. DR PROSITE; PS00868; CYS_MET_METAB_PP; 1. TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT DR DOMO; P31373. FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP DR PRODOM [Domain structure / List of seq. sharing at least 1 domain] FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHRDR PROTOMAP; P31373. DR PRESAGE; P31373. DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA DR SWISS-2DPAGE; GET REGION ON 2D PAGE. SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. // FT INIT_MET 0 0 FT BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; Lecture 2. Genome Analysis: Databases, Seq TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RS Formats & Visualization ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG Q (a curated DB) PDB- Protein Data Bank Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization PDB – Provides? Protein Data Bank Protein and Nucleic acid 3D structures Xray, NMR, Computationally predicted Sequence present Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization PDB HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES HEADER COMPND COMPND SOURCE AUTHOR REVDAT JRNL JRNL JRNL JRNL JRNL JRNL LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 2 ATF/CREB SITE DNA 1DGC 4 GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 T.J.RICHMOND 1DGC 6 1 22-JUN-94 1DGC 0 1DGC 7 AUTH P.KONIG,T.J.RICHMOND 1DGC 8 TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 TITL 3 FLEXIBILITY 1DGC 11 REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 1 1DGC 14 2 1DGC 15 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 3 1DGC 17 3 REFINEMENT. 1DGC 18 3 PROGRAM X-PLOR 1DGC 19 3 AUTHORS BRUNGER 1DGC 20 3 R VALUE 0.216 1DGC 21 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 3 1DGC 24 3 NUMBER OF REFLECTIONS 3296 1DGC 25 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27 3 PERCENT COMPLETION 98.2 1DGC 28 3 1DGC 29 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 4 1DGC 32 SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES 1A 2A 3A 4A 5A 1B 2B HELIX 62 62 62 62 62 19 19 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 T G G A G A T G A C G T C 1DGC 65 A T C T C C 1DGC 66 1 A ALA A 228 LYS A 276 1 1DGC 67 CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 Analysis: 0.017047 0.000000Seq 0.000000 0.00000 Lecture 2. Genome Databases, SCALE2 0.000000 0.017047 0.000000 0.00000 Formats & Visualization SCALE3 0.000000 0.000000 0.011539 0.00000 8 1DGC 68 1DGC 69 1DGC 70 1DGC 71 1DGC 72 1DGC 73 1DGC 74 Data Formats Flat Files √ Many other formats for particular uses… XML, Clustal (for multiple sequence alignments), GFF (for sequence annotation), etc… FASTA – simplest! High throughput data file formats: BAM, etc. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization FASTA > >gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4 MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE R Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization FASTA > Your favourite gene 1 - yfg1 MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE R > Your favourite gene 2 - yfg2 MQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIVIV DTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENWTI TSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLLEDNSKEW EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIV Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization In GenBank, records are organized for various reasons. Understanding the rationale behind “groupings” and “numbering” systems for such databases is the key to fully taking advantage of database resources appropriately! Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization LOCUS vs Accession vs PID vs protein_id: What’s the difference? LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases. ACCESSION: A unique identifier to that record (particular sequence) in GenBank/EMBL/DDBJ that does not change when record is updated. Nucleotide gi: Geninfo identifier (gi), a unique integer specific for GenBank which will change every time the sequence changes. VERSION: System started in 1999 for GenBank/EMBL/DDBJ where the accession and version play the same function as the accession and gi number. Format: accession.version PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS (coding sequence). Protein gi: Geninfo identifier (gi), a GenBank unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession with version numbers. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization LOCUS, Accession, NID, gi and PID LOCUS DEFINITION ACCESSION VERSION HSU40282 1789 bp mRNA PRI 21-MAY-1998 Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds. U40282 U40282.1 GI:3150001 LOCUS: ACCESSION: VERSION: GI: PID: Protein gi: protein_id: CDS HSU40282 U40282 U40282.1 3150001 g3150002 3150002 AAC16892.1 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" Lecture 2. Genome Analysis: Databases, Seq /db_xref="GI:3150002" Formats & Visualization Which of these would you use to cite a sequence in a paper? Can you think of situations where you would use one over another? Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Which of these would you use to cite a sequence? When would you use one over another? LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases. ACCESSION: A unique identifier to that record (particular sequence) in GenBank/EMBL/DDBJ that does not change when record is updated. Nucleotide gi: Geninfo identifier (gi), a unique integer specific for GenBank which will change every time the sequence changes. (and can disappear!) VERSION: System started in 1999 for GenBank/EMBL/DDBJ where the accession and version play the same function as the accession and gi number. Format: accession.version PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS (coding sequence). Protein gi: Geninfo identifier (gi), a GenBank unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession with version numbers. Briefly…Examples of Functional Divisions PAT EST STS GSS HTG HTC Patent Expressed Sequence Tags Sequence Tagged Site Genome Survey Sequence High Throughput Genome (unfinished) High throughput cDNA (unfinished) Genbank overview: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch1 Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Other Sequence (& related) File Formats Historically, a number of other sequence and annotation file formats have been proposed, see: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html The demands of representing NGS data have given rise to additional file formats and data compression standards, some of which you will encounter in this course. The next few slides will present an overview of a few of these emergent NGS formats and standards. See: http://www.broadinstitute.org/software/igv/FileFormats http://www.broadinstitute.org/software/igv/RecommendedFileFormats http://genome.ucsc.edu/FAQ/FAQformat Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Other Sequence (& Annotation) File Formats FASTQ – FASTA with quality data 2bit – compressed DNA sequence format SAM/BAM – Sequence Alignment Mapping GFF/GTF – General Feature Format BED/WIG – annotation track data formats Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization FASTQ FASTQ – FASTA “with an attitude” (embedded quality scores). Originally developed at the Sanger to couple (Phred) quality data with sequence, it is now common to specify raw read output data from NGS machines in this format. @EAS54_6_R1_2_1_443_348 Various flavors: fastq-sanger fastq-illumina fastq-solexa GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 *-+*''))**55CCF>>>>>>CCCC Differing in the format of the sequence identifier and in the valid range of quality scores. See: http://en.wikipedia.org/wiki/FASTQ_format http://maq.sourceforge.net/fastq.shtml http://nar.oxfordjournals.org/content/early /2009/12/16/nar.gkp1137.full “…the Sanger version of the FASTQ format has found the broadest acceptance, supported by many assembly and read mapping tools …Therefore, most users will do this conversion very early in their Lecture 2. Genome Analysis: Databases, Seq workflows…” Formats & Visualization http://hannonlab.cshl.edu/fastx_toolkit/ Linux, MacOSX or Unix only Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization 2bit File Format Highly compressed sequence file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself. http://genome.ucsc.edu/FAQ/FAQformat#format7 Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization SAM/BAM SAM– a tab-delimited text file that contains a compact and index-able representation of nucleotide sequence alignments http://samtools.sourceforge.net/SAM1.pdf http://samtools.sourceforge.net/ BAM – binary version of SAM (preferred by IGV) I/O format of several NGS tools, see: http://samtools.sourceforge.net/swlist.shtml See also: Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Gene/General/Generic Feature Formats (GFF) A General Feature Format (GFF) file is a relatively simple tab-delimited text file for describing genomic features. Many genome browsers – gbrowse, IGV, etc. - take GFF as input for annotation data There are several slightly but significantly different GFF file formats (GFF,GFF2, GFF3, GTF). The current primary standard is GFF3: http://www.sequenceontology.org/gff3.shtml Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Excerpt of a GFF File ##gff-version 3 1 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization BED File Format BED format provides a flexible way to define the data lines that are displayed in an annotation track in a genome browser. http://genome.ucsc.edu/FAQ/FAQformat#format1 If your data set is BED-like, but it is very large and you would like to keep it on your own server, you should use the bigBed data format. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization WIGgle format The Wiggle format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. http://genome.ucsc.edu/goldenPath/help/wiggle.html If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization EMBOSS Sequence Analysis Suite emboss.sourceforge.net Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Open Bioinformatics Foundation bioperl / biojava / biopython / bioruby / biosql etc. www.open-bio.org Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Sequence Databases: “Roll your Own”? GMOD BioSQL: a lightweight database schema for storing and retrieving (annotated) sequence records using OpenBio software tools. GMOD “Chado”: a more complex database schema for storing sequence data, genome feature annotation and a host of other related biological data (initially inspired by Drosophila genome annotation and genetics; supported by many GMOD software tools) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Retrieving Sequence Information: Using integrated database resources such as Entrez Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization What you may be looking for: Heard on CBC about a disease gene that was recently discovered, and you want to know more about it. Want to build a dataset of DNA sequences upstream of a set of co-expressed genes, to identify common regulatory element sequences Evolutionary, functional, structural analyses, etc… Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Entrez: Initial version of this “Pathway to Discovery” Term frequency statistics Literature citations in sequence databases MEDLINE abstracts Nucleotide sequences Nucleotide sequence similarity Literature citations in sequence databases Protein sequences Coding region features Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Amino acid sequence similarity PubMed Text Neighboring Genetic Analysis of Cancer in Families The Genetic Predisposition to Cancer • Common terms could indicate similar subject matter • Statistical method • Weights based on term frequencies within document and within the database as a whole • Some terms are better than others Entrez began to integrate more data… MEDLINE Expression Data PubMed online Journals Full text Accession Numbers GenBank SNP Data ACGATGTGGTCGATG TTCTCTATTATTATC GGAAGCTAAGGATAT CGCTGATGTGAGGTGA TCGGTTCTATCTGCA TAGCATGGATATTGA TGGCTTATAGGCTAG CGCTGATGTGAGGTG Links MVILLVILAIVLISD VTGREGSWQIPCMNV KRKKGREGDHIVLIL ILLNNAWASVLPESDS SDSGPLIILHEREKR LALAMAREENSPNCT PLIKRESAEDSEDLR KRKKTDEDDHIVLIL MMDB structure:function Accession Numbers - Map Genomes Protein Sequences VAST Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Structures Entrez Entrez Help http://www.ncbi.nlm.nih.gov /books/NBK3837/ Check out also What’s New http://www.ncbi.nlm.nih.gov /books/NBK1969/ Or @NCBI on Twitter to keep up on new features added (like the Database of Genomic Structural Variation recently released) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization SFU’s Cenk Sahinalp international leader in structural variation bioinformatics research BLink Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Other Sequence Databases and Sequence Data Visualization Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization The Ensembl Genomes Database: Focuses on humans and select vertebrates (but a plant version is also available…) www.ensembl.org Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization What is Ensembl? Publicly available, automated annotation of selected eukaryotic genomes (initially with mammalian focus) Open source software (but slightly complicated to set up…) Multiple different ways to access data, including programmatic (Perl API) Provides access to additional data from other groups (distributed annotation system or DAS) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization ENSEMBL – Region in Detail Check out the “Printable mini-course” at http://uswest.ensembl.org/inf o/website/tutorials/index.htm l Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Generic Model Organism Database (GMOD) Project www.gmod.org Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization BioMart (Ensmart) A powerful querying system (later: we’ll learn about Ensembl’s Perl API) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Distributed Annotation System (DAS) Allows Third-Party annotation Users choose the annotation they are interested in Good for specialized feature annotation or for comparison of different methodologies Allows you to view different data in a consistent user interface/display Open source display focused on eukaryotes Ensembl Open source display for any dataset Gbrowse Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Gbrowse: Another genome data viewer with DAS http://gmod.org/wiki/GBrowse Gene track Protein track Metabolic pathways track Regulons track 3D structures track Intergenic sequences track Terminators track DNA sequence track Translation track Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Gbrowse is used to display genomic data for many projects Mouse, Rat, Fly, C. elegans and other animals Rice and a number of other plants S. cerevisiae and other yeasts A number of unicellular eukaryotes Many many prokaryotes Other types of data: HapMap, Segmental Duplications, RNA-seq data-specific or other type-specific data ** Open source package ** (slightly simpler to set up than Ensembl) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Entrez, Ensembl, Gbrowse: What’s the difference? • Entrez – Search and retrieval system for major databases, including PubMed, Sequences (including genomes), Structures, Taxonomy, etc. – NCBI (Maryland, USA) centrally hosts Entrez and they decide what to host and maintain – Not open source • Ensembl – Automated annotation of selected eukaryotic genomes – EMBL-EBI and the Sanger Institute (Cambridge/Hinxton, UK) centrally hosts most resources and they decide what data to host and maintain. – Open source and can obtain a local copy plus access other DAS data • Gbrowse – Genome/genomic data viewer – Very decentralized – anyone can set it up and publicly display any data – Open source and can set up a local copy plus access other DAS data Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Entrez, Ensembl, Gbrowse: Benefits/Disadvantages of each? • • • Entrez • Reputable institution – trust in the data • Maintained by well established group with a lot of capital • Perceived more consistency • Limited to what they make available • They make the call on how to display it, analyze it, and classify it • Some of the analyses are definitely a black box Ensembl • Open source – can see how the data is analyzed/processed – NOT necessarily an issue with lower quality data – a lot of eyes are watching you (wooahh haa haa…) • Reputable institution – trust in the data Gbrowse • Easy to use and set up • Open source – can see how the data is analyzed/processed • Anybody can release their data to the world Lecture 2. Genome Analysis: Databases, Seq & Visualization • Anybody can analyze the dataFormats in they want and release it to the world Local Visualization of NGS Data http://www.broadinstitute.org/igv/ Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization How do I update or correct errors in the Databases? Example: For Gene names, citations, new protein name, sequencing errors in Genbank… update@ncbi.nlm.nih.gov But most people don’t bother to correct things that they notice are wrong… increased need for more focused community-based projects Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Community Assisted Curation of Subsets of Datasets Core curators continually update annotation of a data subset (i.e. a genome) Literature review Input from the community Updates sent in batches to centralized databases - > additional review -> becomes, for example, an NCBI RefSeq Examples: WormBase.org, Pseudomonas.com Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Ethical issues with bioinformatics databases How public and/or open source should biomolecular data be? How much should researchers be forced to release data as soon as possible? How much analysis of a genome can a researcher publish before the genome sequence is published? How do we best organize the data? BIG issue! i.e. biomolecular pathway classifications can bias analyses of pathways are found to be upregulated or downregulated by gene expression analysis Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization Resources http://www.ncbi.nlm.nih.gov/ http://www.ebi.ac.uk/ http://www.expasy.ch/ http://www.ensembl.org/ http://www.rcsb.org/pdb/ http://www.pseudmonas.com/ http://www.wormbase.org/ http://biodas.org/ http://nar.oupjournals.org/ http://www.gmod.org/ http://www.broadinstitute.org/igv/ Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization