Sequence Analysis (I) Yuh-Shan Jou (周玉山) jou@ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica Bioinformatics • Bioinformatics is the application of information technology to analyze, process, and manage biological data. • Bioinformatics provides computational tools to facilitate the process of Data Information Knowledge Discovery Don’t believe everything you see in DB or even in GenBank! QC is the most important aspect and concern in Bioinformatics! Roadmap to Genomics Hum an Genom e 1 . Markers: EST: Expressed Sequence Tag . STS: Sequence Tag Sit e. STR: Sho rt Tandem Repeat . 2 . g eno mic DNA co nt ig s: Cosmid cont igs YAC cont igs Diseases Markers f o r diag no sis *1. *2. *3. Expression pat t erns Expression profiles Microarray of genes cDNA sequencing Hum an Genom e Physical Maps Po s it io n al Clo n in g Dat abase ESTs ( d b EST ) 1 . BAC o r PAC co nt ig s 2 . Sequencing t echno lo g ies Full length cDNAs Radiat io n Hy b rids Mapping Panels * Diagnosis w it h GeneChips Sequencing of Human Genome Po sit io nal Candidat e Appro aches T ranscript ional M ap of hum an Genom e Po sit io nal Candidat e Appro aches Functional Genomics A Vision for the Future of Genome Research Francis S. Collins (National Human Genome Research Institute, NIH, USA) Nature 422:835 (2003) International Sequence Database Collaboration Entrez NIH NCBI GenBank •Submissions •Updates •Submissions •Updates EMBL CIB NIG DDBJ •Submissions •Updates getentry EBI SRS EMBL www.ensembl.org Lecture 7.1 6 http://genome.ucsc.edu Integration Bioinformatics Data Bases and Scientific Algorithms Medline (Asn.1) Microarray Data (RDBMS, Excel) BLAST (FASTA) OMIN (Text File) Integration BioInformatics KEGG (HTML Text, Binary Images) Entrez/NCBI (Asn.1) ClustalW (FASTA) PDB (Oracle, 3D images) Web Access: www.ncbi.nlm.nih.gov NCBI Web Traffic 600,000 User’s per day World Internet Users 500,000 400,000 US Internet Users 300,000 200,000 100,000 1998 1999 2000 2001 2002 2003 2004 Christmas and New Year’s Day 2005 The Entrez System: Text Searches Types of Databases • Primary Databases – Original submissions by experimentalists – Content controlled by the submitter • Examples: GenBank, SNP, GEO • Derivative Databases – Built from primary data – Content controlled by third party (NCBI) • Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain Entrez Nucleotides • • • • Primary GenBank / EMBL / DDBJ 49,675,750 Derivative RefSeq 545,503 Third Party Annotation 4,544 PDB 5,561 Total 50,231,358 Entrez Protein: Derivative Databases GenPept 3,950,968 RefSeq 1,348,072 Third Party Annotation Swiss Prot PIR PRF PDB Total 4,133 170,087 282,821 12,079 61,845 5,830,005 BLAST nr total 2,336,522 The Growth of GenBank GenBank Growth 50 50 Basepairs Records 45 45 Release 148: 35 30 45.2 million records 49.4 billion nucleotides 35 30 25 25 Average doubling time ≈ 14 months* 20 20 Date Jun-04 Jun-02 Jun-00 Jun-98 0 Jun-96 0 Jun-94 5 Jun-92 5 Jun-90 10 Jun-88 10 Jun-86 15 Jun-84 15 Records (millions) 40 Jun-82 Base Pairs (billions) 40 Organization of GenBank: Traditional Divisions Records are divided into 17 Divisions. 11 Traditional 6 Bulk Traditional Divisions: • Direct Submissions (Sequin and BankIt) • Accurate • Well characterized PRI (28) Primate PLN (13) Plant and Fungal BCT (11) Bacterial and Archeal INV (7) Invertebrate ROD (15) Rodent VRL (4) Viral VRT (7) Other Vertebrate MAM (1) Mammalian PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated Entrez query: gbdiv_xxx[Properties] Organization of GenBank: Bulk Divisions Records are divided into 17 Divisions. 11 Traditional 6 Bulk BULK Divisions: • Batch Submission (Email and FTP) • Inaccurate • Poorly characterized EST (355) GSS (132) HTG (62) STS (5) HTC (6) PAT (17) Expressed Sequence Tag Genome Survey Sequence High Throughput Genomic Sequence Tagged Site High Throughput cDNA Patent Entrez query: gbdiv_xxx[Properties] File Formats of the Sequence Databases Each sequence is represented by a text record called a flat file. GenBank/GenPept (useful for scientists) FASTA ASN.1 & XML (the simplest format) (useful for programmers) LOCUS DEFINITION AY182241 1931 bp mRNA linear PLN 04-MAY-2004 Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // A Traditional GenBank Record Header The Flatfile Format Feature Table Sequence The Header LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REMARK COMMENT AY182241 1931 bp mRNA linear PLN 04-MAY-2004 Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. AY182241 AY182241.2 GI:32265057 . Malus x domestica (cultivated apple) Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. 1 (bases 1 to 1931) Pechous,S.W. and Whitaker,B.D. Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit Planta 219, 84-94 (2004) 2 (bases 1 to 1931) Pechous,S.W. and Whitaker,B.D. Direct Submission Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA 3 (bases 1 to 1931) Pechous,S.W. and Whitaker,B.D. Direct Submission Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA Sequence update by submitter On Jun 26, 2003 this sequence version replaced gi:27804758. Header: Locus Line LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica synthase (AFS1) mRNA, LOCUS AY182241 1931 (E,E)-alpha-farnesene bp mRNA linear PLN 04-MAY-2004 complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. Length Locus name Molecule type Division Modification Date Header: Database Identifiers LOCUS DEFINITION AY182241 1931 bp mRNA linear PLN 04-MAY-2004 Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, Accession complete cds. ACCESSION AY182241 •Stable VERSION AY182241.2 GI:32265057 •Reportable KEYWORDS . •Universal SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. Version REFERENCE 1 (bases 1 to 1931) GI number AUTHORS Pechous,S.W. and Whitaker,B.D. Tracks changes in sequence NCBI internal use TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 Header: Organism LOCUS DEFINITION AY182241 1931 bp mRNA linear PLN 04-MAY-2004 Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) SOURCE (cultivated apple) ORGANISMMalus Malusxx domestica domestica ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Streptophyta; eudicotyledons; core eudicots; Eukaryota; Viridiplantae; Embryophyta; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; REFERENCE 1 (bases 1 to 1931) eudicots; eurosids I; Rosales; Rosaceae; AUTHORS core Pechous,S.W. androsids; Whitaker,B.D. TITLE Maloideae; Cloning andMalus. functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D.NCBI-controlled taxonomy TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. The Feature Table FEATURES source gene CDS start (atg) Coding sequence Implied protein Location/Qualifiers 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" 1..1931 /gene="AFS1" stop (tag) 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" GenPept Identifiers /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI The Sequence: 99.99% Accurate ORIGIN // 1 61 121 181 ttcttgtatc tcagagttca aacctgaagc ggaagaacga ccaaacatct cttgcaagct ctcttacttg tttcctagat cgagcttctt gataatgagc attaatcaaa caatctctta gtacaccaaa agaaaatttt gacggtctgc tcagcaaata ttaggtattc tcaaaaccag aaattacaag cgatggagat actatggaat atgaaacccg ccaaatattt gagtatcgga 1741 1801 1861 1921 ggacccacat aataaatagc tgtaacgttg aaaaaaaaaa cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa a FASTA Format >gi|30256|emb|CAA42556.1| c-src-kinase [Homo sapiens] MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG >gi|30256|emb|CAA42556.1| c-src-kinase [Homo sapiens] VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV gi number> Accession.Version Locus Name Organism DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS NCWHLDAAMRPSFLQLREQLEHIKTHELHL RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV Database Identifiers: KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK gb GenBank NCWHLDAAMRPSFLQLREQLEHIKTHELHL emb dbj ref sp pdb pir prf tpg tpe tpj EMBL DDBJ RefSeq SWISS-PROT Protein Databank PIR PRF TPA-GenBank TPA-EMBL TPA-DDBJ Abstract Syntax Notation: ASN.1 Seq-entry ::= set { class nuc-prot , descr { title "Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds." , source { org { taxname "Malus x domestica" , common "cultivated apple" , db { { db "taxon" , tag id 3750 } } , orgname { name binomial { genus "Malus" , species "x domestica" } , mod { { subtype cultivar , GenPept GenBank ASN.1 FASTA Protein FASTA Nucleotide Bulk Divisions •Batch Submission and htg (email and ftp) •Inaccurate •Poorly Characterized • Expressed Sequence Tag – 1st pass single read cDNA • Genome Survey Sequence – 1st pass single read gDNA • High Throughput Genomic – incomplete sequences of genomic clones • Sequence Tagged Site – PCR-based mapping reagents EST Division: Expressed Sequence Tags gbdiv_est[Properties] >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA 5’ TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA 30,000 GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC genes 3’ AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus - isolate unique clones >IMAGE:275615 3', mRNA sequence -sequence once RNA NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA from each end gene products TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC make cDNA library 80-100,000 unique cDNA clones in library ESTs in Entrez Total Human Mouse Rat Zebrafish Wheat Barley Maize 26 million records 6.0 million 4.3 million 0.7 million 0.6 million 0.6 million 0.3 million 0.4 million Genome Sequencing - HTG, GSS, (WGS) Whole BAC insert (or genome) shredding sequencing GSS division or trace archive assembly cloning isolating whole genome shotgun assemblies (traditional division) Draft Sequence (HTG division) HTG Division: Rice Draft Sequences •Unfinished sequences of BACs •Gaps and unordered pieces •Finished sequences move to traditional GenBank division Whole Genome Shotgun Projects • Traditional GenBank Divisions • 200 + projects – – – – – Virus Bacteria Environmental sequences Archaea 51 Eukaryotes featuring: • • • • • • Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (C. briggsae) Yeasts (8), Aspergillus (2) Rice Zebrafish: WGS wgs_master[Properties] Derivative Databases UniGene RefSeq TPA Primary vs. Derivative Sequence Databases RefSeq Labs Sequencing Centers TATAGCCG AGCTCCGATA CCGATGACAA Curators TATAGCCG TATAGCCG TATAGCCG TATAGCCG Updated continually by NCBI GenBank Updated ONLY by submitters Genome Assembly UniGene Algorithms What is UniGene? A gene-oriented view of sequence entries •MegaBlast based automated sequence clustering •Now informed by genome hits New! •Nonredundant set of gene oriented clusters •Each cluster a unique gene •Information on tissue types and map locations •Includes known genes and uncharacterized ESTs •Useful for gene discovery and selection of mapping reagents EST hits: Human mRNA Albumin mRNA 5’ EST hits 3’ EST hits UniGene: Expressed Sequences Expression Data RELEASE 11 (May 13, 2005) AVAILABLE ON THE FTP SITE! • Forming the “best representative” sequence • Standardizing nomenclature and record structure • Adding annotation (references, sequence features) • Stable reference for example, gene identification, • • • polymorphism discovery, comparative analysis RefSeq Release 11 includes over 1,425,971 proteins and 2928 organisms. The release is available by FTP at: ftp://ftp.ncbi.nih.gov/refseq/release/ RefSeq number is still not fixed. srcdb_refseq[Properties] Curated RefSeq Records LOCUS DEFINITION ACCESSION VERSION ADSS 1368 bp mRNA linear PRI 27-AUG-2002 Homo sapiens adenylosuccinate synthase (ADSS), mRNA. NM_001126 RefSeq Nucleotide NM_001126.1 GI:4557270 LOCUS ADSS 455 aa linear PRI 27-AUG-2002 DEFINITION adenylosuccinate synthase; Adenylosuccinate synthetase (Ade(-)H-complementing) Homo sapiens . ACCESSION NP_001117 VERSION NP_001117.1 GI:4557271 RefSeq Protein DBSOURCE REFSEQ: accession NM_001126.1 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from X66503.1. Summary: Adenylosuccinate synthetase catalyzes the first committed step in the conversion of IMP to AMP. X records: Genome Annotation & Inferred or Predicted vs N records: Provisional, Reviewed or Validated RefSeq Accession Numbers mRNAs and Proteins NM_123456 NP_123456 NR_123456 XM_123456 XP_123456 XR_123456 Gene Records NG_123456 Chromosome NC_123455 Assemblies NT_123456 NW_123456 Curated mRNA Curated Protein Curated non-coding RNA Predicted mRNA Predicted Protein Predicted non-coding RNA Reference Genomic Sequence Microbial replicons, organelle Contig WGS Supercontig RefSeq Curation Processes Curated genomic DNA (NC, NT, NW) Scanning.... Curated Model mRNA (XM) Model protein (XP) (XR) Curated mRNA (NM) Protein (NP) (NR) http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins – reviewed – human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more • Model transcripts and proteins • Assembled Genomic Regions (contigs) – human genome – mouse genome – rat genome • Chromosome records – Human genome – microbial srcdb_refseq[Properties] – organelle ftp://ftp.ncbi.nih.gov/refseq/release/ RefSeq Benefits • • • • • • • non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators Third Party Annotation (TPA) Database • Annotations of existing GenBank sequences • Allows for community annotation of genomes • Direct submissions – BankIt – Sequin tpa[Properties] TPA record: WGS Assembly CDS Feature TPA protein Human Nucleotide Sequences ISDC 8,965,327 (GenBank/EMBL/DDBJ) PRI 916,017 (WGS 601,855) EST 6,003,916 GSS 905,645 HTG 18,364 HTC 49,373 STS 117,870 PAT 953,269 RefSeq TPA Total 35,934 893 9,002,154 Other NCBI Databases •dbSNP: •Geo: nucleotide polymorphism Gene Expression Omnibus microarray and other expression data •Gene: gene records Unifies LocusLink and Microbial Genomes •Structure: imported structures (PDB) Cn3D viewer, NCBI curation •CDD: conserved domain database Protein families (COGs) Single domains (PFAM, SMART, CD) NCBI’s SNP Database • • • • • • Primary Database and Derivative (RefSNP) Single Nucleotide Polymorphism Repeat polymorphisms Insertion-Deletion Polymorphisms 24 Species Over 15 million submissions Submitted SNP Hemachromatosis SNP RefSNP •Non-redundant •Computational Analysis •BLAST hits to •genome, mRNA, protein and structure Sequence Similarity Searching Basic Local Alignment Search Tool (BLAST) Text Pubmed Sequence BLAST Structure VAST Pairwise Alignment Summary Global • Best score for aligning the full length sequences • Dynamic programming • Algorithm: Needelman- Wunch • Table cells are allowed any score Local • Best score for aligning part of sequences • Dynamic programming • Algorithm: Smith-Waterman • Table cells never score below zero Global vs Local Alignment Seq 1 Seq 2 Global alignment Seq 1 Seq 2 Local alignment Global Alignment Align program (Lipman and Pearson) Human: 15 Worm: 63 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 human 85M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L M S .. AA SG. . .A ... . Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 20 40 60 Human: 152 220 Worm: L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M 440 YLE NF+HRDLAARN+L++ 450 K++DFGL KE TG + P+KWTA human REQLEHI--------KTHELHL Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 . .:: . : ... Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 worm QWKLEDLFNLDSSEYKEASINF PEA +F+TKSDVWSFGILL EI 500 +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 Basic Local Alignment Search Tool • • • • • Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. – – – – – DNA vs DNA DNA translation vs Protein Protein vs Protein Protein vs DNA translation DNA translation vs DNA translation • www, standalone, and network clients What BLAST tells you • BLAST reports surprising alignments – Different than chance • Assumptions – Random sequences – Constant composition • Conclusions – Surprising similarities imply evolutionary homology Evolutionary Homology: descent from a common ancestor Does not always imply similar function BLAST/FASTA variants for different searches Program Query Database Comparison Searching purpose blastn/fasta DNA DNA DNA level homologous DNA blastp/fasta Protein Protein Protein level homologous protein blastx/fastx DNA Protein Protein level New genes from DNA tblastn/tfasta Protein DNA Protein level New genes from peptide tblastx/tfastx DNA DNA Protein level New genes from DNA BLAST Web site: http://www.ncbi.nlm.nih.gov/BLAST FASTA Web sites: http://www2.ebi.ac.uk/fasta3/ or http://www.fasta.genome.ad.jp/ BLASTN Databases nr GenBank, EMBL, DDBJ, PDB and NCBI reference sequences (RefSeq) htgs High-throughput genomic sequences (draft) pat Patented nucleotide sequences mito Mitochondrial sequences vector Vector subset of GenBank month GenBank, EMBL, DDBJ, PDB from 30 days chrom Contigs and chromosomes from RefSeq BLASTP Databases nr GenBank CDS translations, RefSeq, PDB, SWISS-PROT, PIR, PRF swissprot SWISS-PROT pat Patented protein sequences pdb Protein Data Bank month GenBank CDS translations, PDB, SWISSPROT, PIR, PRF from 30 days Nucleotide Words GTACTGGACATGGACCCTACAGGAACGTATACGTAAG 11-mer GTACTGGACAT GTACTGGACATGGACCCTACAGGAACGT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC TGGACATGGACCCTACAGGAACGTATAC GGACATGGACC WORD SIZE GACATGGACCC blastn ACATGGACCCT . . . Query Make a lookup table of words Def. Min. 11 7 28 12 megablast CATGGACCCTACAGGAACGTATACGTAA . . . Protein Words Query: GTQITVEDLFYNIATRRKALKN GTQ Word size = 3 (default) TQI Word size can only be 2 or 3 QIT Neighborhood Words ITV LTV, MTV, ISV, LSV, etc. Make a lookup table of words TVE VED EDL DLF ... Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT exact word match one match •Nucleotide BLAST requires one exact match •Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN neighborhood words two matches BLAST Algorithm (1) Query sequence Words of length W W default = 11 (2) Compare the word list to the database and identify exact matches (3) For each word match, extend alignment in both directions (4) Compute E-value An alignment that BLAST can’t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Here there are no words longer than 6…...for nucleotides there must be an exact match of at least 7. An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX BLAST 2 Sequences (blastx) output: Score = 290 bits (741), Expect = 7e-77 Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%) Frame = +3 Nucleotide vs. Protein BLAST Comparing ADSS from H. sapiens and A. thaliana aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc H.sapiens: N R V T V V L G A Q W G D E G + + V + V L G Q W G D E G A.thaliana: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt BLASTn finds no match, because there are no 7 bp words BLASTp finds three matching words Protein searches are generally more sensitive than nucleotide searches. The Flavors of BLAST • Standard BLAST – traditional “contiguous” word hit – position independent scoring – nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) • Megablast – optimized for large batch searches – can use discontiguous words • PSI-BLAST – constructs PSSMs automatically; uses as query – very sensitive protein search • RPS BLAST – searches a database of PSSMs – tool for conserved domain searches Megablast: NCBI’s Genome Annotator • • • • Long alignments for similar DNA sequences Concatenation of query sequences Faster than blastn Contiguous Megablast – exact word match – Word size 28 • Discontiguous Megablast – initial word hit with mismatches – cross-species comparison MegaBLAST >AI217550 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTG AI251192 GTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCT TTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGT AI254381 GACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCG BE645079 TCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAAC CACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC > 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGT TTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCT CCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAA GGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACA CCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAA AACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTC CTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAA GCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT end end C:\seq\hs.4.fsa end end Templates for Discontiguous Words W W W W W W W W W W W W = = = = = = = = = = = = 11, 11, 12, 12, 11, 11, 12, 12, 11, 11, 12, 12, t t t t t t t t t t t t = = = = = = = = = = = = 16, 16, 16, 16, 18, 18, 18, 18, 21, 21, 21, 21, coding: non-coding: coding: non-coding: coding: non-coding: coding: non-coding: coding: non-coding: coding: non-coding: 1101101101101101 1110010110110111 1111101101101101 1110110110110111 101101100101101101 111010010110010111 101101101101101101 111010110010110111 100101100101100101101 111010010100010010111 100101101101100101101 111010010110010010111 W = word size; # matches in template t = template length (window size within which the word match is evaluated) Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5 Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 –3 G –3 +1 –3 –3 CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| CACGTAGCAAGCTTG-GTGTCA C –3 –3 +1 –3 T -3 -3 -3 +1 raw score = 19-9 = 10 Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST BLOSUM62 A 4 R -1 5 N -2 0 D -2 -2 C 0 -3 Q -1 1 E -1 0 G 0 -2 H -2 0 I -1 -3 L -1 -2 K -1 2 M -1 -1 F -2 -3 P -1 -2 S 1 -1 T 0 -1 Negative W -3 -3 Y -2 -2 V 0 -3 X 0 -1 A R 6 1 6 Common amino acids have low -3 -3 9 0 0 -3 5 0 2 -4 2 5 0 -1 -3 -2 -2 6 1 -1 -3 0 0 -2 8 -3 -3 -1 -3 -3 -4 -3 4 -3 -4 -1 -2 -3 -4 -3 2 4 Rare amino acids have high 0 -1 -3 1 1 -2 -1 -3 -2 5 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 for substitutions -4 less -4 -2likely -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 -1 -1 -2 for -1 more -1 -1 likely -1 -1substitutions -1 -1 -1 -1 -2 0 Positive N D C Q E G H I L K M F P S weights weights 5 -2 11 -2 2 7 0 -3 -1 4 0 -2 -1 -1 -1 T W Y V X Gapped Alignments • • • Gapping provides more biologically realistic alignments Statistical behavior is not completely understood for gapped alignments Gapped BLAST parameters must be found by simulations for each matrix Gap costs: -(a+bk) a = gap open penalty b = gap extend penalty k= number of residues For example: A gap of 1 residue receives the score “-(a+b)”. Scores Simply add the scores for each pair of aligned residues and (as necessary) factor in the gaps! V V BLOSUM62 +4 PAM30 +7 D S – C Y E T L C F +2 +1 -12 +9 +3 +2 0 -10 +10 +2 = 7 = 11 . Different matrices produce different scores! Matrix differences PAM Built from global alignments BLOSUM Built from local alignments Built from small amout of data Built from vast amout of data based on minimum replacement or maximum parsimony based on groups of related sequences counted as one better for finding global alignments and remote homologs better for finding local alignments Higher PAM series means more divergence Lower BLOSUM series means more divergence Matrices - Rules of thumb Need different levels of sensitivity ? – Close relationships (Low PAM number (PAM 1) or high Blosum number, eg. 80) – Distant relationships (High PAM (e.g. PAM 250), low Blosum (BLOSUM 45) Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance Alignments size of database your score expected number of random hits Score E = Kmne-S E = mn2-S’ K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2 (applies to ungapped alignments) WWW BLAST The BLAST homepage Standard databases Specialized Databases BLAST Databases: Nucleic Acid • nr (nt) – Traditional GenBank – NM_ and XM_ RefSeqs • refseq_rna • refseq_genomic – NC_ RefSeqs • dbest – EST Division • est_human, mouse, others • htgs – HTG division • gss – GSS division • wgs – whole genome shotgun • env_nt – environmental samples Options for Advanced Blasting: Nucleotide Example Entrez Queries nucleotide all[Filter] NOT mammalia[Organism] green plants[Organism] biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced -W 7 word size –e 10000 expect value -v 2000 descriptions -b 2000 alignments BLAST Databases: Non-redundant protein nr (non-redundant protein sequences) – GenBank CDS translations – NP_ RefSeqs – Outside Protein • PIR, Swiss-Prot, PRF • PDB (sequences from structures) pat protein patents env_nr environmental samples Advanced Options: Filter all[Filter] NOT mammals[Organism] gene_in_mitochondrion[Properties] 2003:2005 [Modification Date] tpa[Filter] Nucleotide Default setting biomol_mrna[Properties] biomol_genomic[Properties] Hides low complexity for initial word hits only Masks regions of query in lower case (pre-masked) BLAST Formatting Page BLAST Output: Graphic Sort by taxonomy mouse over BLAST Output: Descriptions Sorted by e values 3 X 10-12 link to entrez Gene Linkout Default e value cutoff 10 TaxBLAST: Taxonomy Reports BLAST Output: Alignments >gi|127552|sp|P23367|MUTL_ECOLI Length = 615 DNA mismatch repair protein mutL Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%) Query 9 Sbjct 280 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ L LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL Identical match positive score (conservative) negative substitution gap 58 338 BLAST Output: Alignments >gi|730028|sp|P40692|MLH1_HUMAN Length = 756 DNA mismatch repair protein Mlh1 1) Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDA Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct: 396 FLQPLSKPLSS 406 low complexity sequence filtered Neighbors: Precomputed BLAST Nucleotide Protein Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details. Blink – Protein BLAST Alignments • Lists only 200 hits • List is nonredundant PSI-BLAST Position-Specific Iterated BLAST • Mining for protein domains • Confirming relationships among related proteins Position-Specific Scoring Matrix (PSSM) 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 D G V I S S C N G D S G G P L N C Q A A 0 -2 -1 -3 -2 4 -4 -2 -2 -5 -2 -3 -3 -2 -4 -1 0 0 -1 R -2 -1 1 3 -5 -4 -7 0 -3 -5 -4 -6 -6 -6 -6 -6 -4 1 -1 N D C 0 2 -4 0 -2 -4 -3 -3 -5 -3 -4 -6 0 8 -5 -4 -4 -4 -6 -7 12 2 -1 -6 Serine is -3 -4 -4 -2 9 -7 -2 -4 -4 -4 -5 -6 -4 -5 -6 -6 -5 -6 -7 -7 -5 0 -6 -4 -5 -5 10 4 2 -5 1 3 -4 Q E G H I L K M F P S T W 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 scored differently in these two positions. -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -5 -5 Active -6 -6 site -6 -7 -4 -6 -7 9 -4 -4 -7 nucleophile -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 Y -4 -6 -4 -4 -5 -5 0 -4 -6 -7 -5 -7 -7 -7 -4 -1 0 -3 -2 V -1 -5 -2 0 -6 -3 -4 -3 -6 -7 -5 -7 -7 -6 0 6 0 -4 -3 Position Specific Iterative BLAST: PSI-BLAST Create your own PSSM: Finding protein families based on your own sequence. query PSSM BLOSUM62 Alignment PSI-BLAST >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK e value cutoff for PSSM RESULTS: Initial BLASTP Same results as protein-protein BLAST Results of First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST Third PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme Check to add to PSSM Reverse Position Specific Iterative-BLAST (a.k.a. RPS-BLAST or CDD Search) A sequence search of the Conserved Domain Database (CDD) containing curated Position-Specific Scoring Matrices. 10 20 30 40 50 60 ....*....|....*....|....*....|....*....|....*....|....*....| consensus 1FGI A 1BYG A gi 125135 gi 125702 gi 1174437 1 1 1 1 1 1 KWEIPREDLTLGKKLGEGAFGEVYKGTLKGkgd---nkSIDVAVKTLKEDASEeqIKEFL aWEIPRESLRLEVKLGQGCFGEVWMGTWNG--------TTRVAIKTLKPGTMS--PEAFL RWELPRDRLVLgkPLGEGAFGQVYLAEAIglgkdkpnrvTKVAVKMLKSDAtedkLSLDI GWALNMKELKLlqTIGKGEFGDVMLGDYRg---------NKVAVKCIKNDAt---AQAFL KYEIPRTDLTLkhKLGGGQYGEVYEGVWKky-------sLTVAVKTLKEDTm--eVEEFL KWEIPRSELTIlrKLGRGNFGEVFYGKWRn--------sIDVAVKTLREGTm--sTAAFL PSSM Sources Pfam SMART COG KOG CD Sanger EMBL NCBI NCBI NCBI 7255 663 4873 4825 645 57 311 74 62 284 325 Reverse Position Specific Iterative-BLAST (a.k.a. RPS-BLAST or CD Search) Query: sequence P03958 Database: PSSMs Result: TyrKc Questions: • Searching for p53 protein homologs with annotation of CDD. • Can you put codon 72 SNP into 3D protein structure? Other Areas to Cover • • • • Genomic Data Annotation Common Domains prediction WWW Other Useful Genome Browsers