Genome browsers and other resources Purposes and features Browse genes in their genomic context. See features in and around specific genes Investigate genome organization Search and retrieve information on a gene- and genome-scale Compare genomes and genome regions The big guys NCBI – http://www.ncbi.nlm.gov Ensembl – http://www.ensemble.org UCSC genome browser – http://genome.ucsc.edu Others of note – RepBase, KEGG, JGI Genome portal There is absolutely no way I can cover everything that you can do while browsing any of these resources. I’ll just hit some highlights. Genome browsers and other resources There is absolutely no way I can cover everything that you can do while browsing any of these resources. I’ll just hit some highlights. Genome browsers and other resources Every January, Nucleic Acids Research publishes a Database issue The 2015 issue is 1274 pages long and has 172 manuscripts in 8 categories 1. Nucleic acid sequence, structure and regulation 2. Protein sequence and structure, motifs and domains 3. Metabolic and signaling pathways, enzymes 4. Viruses, bacteria, protozoa and fungi 5. Human genome, model organisms, comparative genomics 6. Genomic variation, diseases and drugs 7. Plant databases 8. Other databases Genome browsers and other resources Some from this years issue: 1. Nucleic Acid sequence, structure and regulation – highlights from 33 papers Database resources of the National Center for Biotechnology Information The European Bioinformatics Institute’s data resources 2014 The DDBJ Japanese Genotype-phenotype Archive for genetic and phenotypic human data GenBank euL1db – the European database of L1HS retrotransposon insertions in humans ChiTaRS 2.1 – an improved database of chimeric transcripts…. The Eukaryotic Promoter database trFdb: a databse for transfer RNA fragments miRDB: an online resource for microRNA target prediction and functional annotations lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs + four more lncRNA-related papers.NGSmethDB: an updated genome resource for high quality, single-cytosine resolution methylomes Genome browsers and other resources Some from this years issue: 2. Protein sequence and structure, motifs and domains – highlights from 33 papers UniProtL a hub for protein information The InterPro protein families database CDD: NCBI’s conserved domain database InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic MoonProtL a database for proteins that are known to moonlight REBASE – a database for DNA restriction and modification Genome3D: exploiting structure to help users understand their sequences Genome browsers and other resources Some from this years issue: 3. Metabolic and signaling pathways, enzymes - highlights from 17 papers STRING v10: protein-protein interaction networks, integrated over the tree of life EzCatDB: the enzyme reaction database ProteomeScout: a repository and analysis resource for post-translational modifications and proteins Genome browsers and other resources Some from this years issue: 4. Viruses, bacteria, protozoa and fungi – highlights from 14 papers HIV-1, human interaction database NCBI Viral Genomes Resource VirHostNet 2.0: surving the web of virus/host interactions data Update on RefSeq microbial genomes resources GenoBase: comprehensive resource database of Esherichia coli K-12 TrypanoCyc: a community-led biochemical pathways database for Trypanosoma brucei Genome browsers and other resources Some from this years issue: 5. Human genome, model organisms, comparative genomics – highlights from 16 papers Ensembl 2015 The UCSC Genome Browser database: 2015 update Genomicus update 2015: KaryoView and MatrixView provide a genome-wide perspective to multispecies genomics FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases SuperFlyL a comparative database for quantified spatio-temporal gene expression patterns in early dipteran embryos DoGSD: the dog and wolf genome SNP database Genome browsers and other resources Some from this years issue: 6. Genomic variation, diseases and drugs – highlights from 29 papers OMIM.org: Online Mendelian Ineritance in Man (OMIM*), an online catalog of human genes and genetic disorders GRASP v2.0: an update on the Genome-Wide Repository of Associations between SNPs and Phenotypes COSMIC: exporing the world’s knowledge of somatic mutations in human cancer The UCSC Cancer Genomics Browser: update 2015 Mouse Tumor Biology (MTB)L a database of mouse models for human cancer BCCTBbp: the Breast Cancer Campaign Tissue Bank bioinformatics portal Cancer3D: understanding cancer mutations through protein structures 11 total related to cancer EpilepsyGene: a genetic resource for genes and mutations related to epilepsy The Digital Aging Atlas: integrating the diversity of age-related changes into a unified resource Genome browsers and other resources Some from this years issue: 7. Plant databases - highlights from 10 papers PLAZA 3.0: an access point for plant comparative genomics PNRD: a plant non-coding RNA database AraNet v2: an improved database of co-functional gene networks for the study of Arabidopsis thaliana and 27 other nonmodel plant species Araport: the Arabidopsis Information Portal RiceVarMap: a comprehensive database of rice genomic variants The coffee genome hub: a resource for coffee genomes Genome browsers and other resources Some from this years issue: 7. Other databases – highlights from 12 papers Gene Ontology Consortium: going forward Genenames.org: the HGNC resources in 2015 The Genomes OnLine Database (GOLD) v5: a metadata management system based on four level (meta) genome project classification dArk: the database for eukaryotic genome and transcriptome assemblies in 2014 GeneFriends: a human RNA-seq-based gene and transcript co-expression database Genome browsers and other resources NCBI – National Center for Biotechnology Information • Major categories • DNA, RNA and protein • RefSeq – non-redundant set of curated and computationally predicted transcripts, proteins and genomic regions • Genbank – the primary nucleotide sequence archive • Subdivided into Nucleotide, EST, GSS and WGS • Also provides predicted translations of coding sequences • PopSet – related sequences and alignments from population, phylogenetic, mutation and ecosystem studies • Sequence Read Archive (SRA) – raw sequence reads and alignments generated by next generation methods, the data that went into genome assemblies, GWAS, transcriptomes, etc. • Trace Archive – raw data from Sanger sequencing • BioSample – Annotation of biological samples used in studies that ended up contributing data to other repositories • Protein clusters – sets of almost identical RefSeq proteins from multiple genomes • HIV-1/Human Protein Interaction Database – self-explanatory Genome browsers and other resources NCBI – National Center for Biotechnology Information • • Major categories BLAST Sequence analysis • BLAST – sequence similarity searches of all types • blastn, blastp, blastx, tblastn, tblastx • Search nr, WGS, GSS, EST, etc. Can also limit by taxon, molecule, etc. • Multiple output formats to ease processing • Parsing of results possible based on E-value • Primer-blast – uses primer3 for primer design Genome browsers and other resources NCBI – National Center for Biotechnology Information • • Major categories Genes and Expression • Gene – curated sequences and descriptive information about genes with links • RefSeqGene – stable, standard human genomic sequences with mRNAs for wellcharacterized human genes • Conserved CDS Database – human and mouse coding regions • Gene Expression Omnibus – Repository for high-throughput data generated by next-gen and microarray methods • UniGene – transcript sequences • HomolGene – detects homologs by comparison to 21 eukaryotes Genome browsers and other resources NCBI – National Center for Biotechnology Information • • Major categories Genomes • BioProject – central access point for information on genome projects • Genome Reference Consortium - aims to produce assemblies of higher eukaryotic genomes that best reflect complex allelic diversity consistent with currently available data. Currently produces assemblies for human, mouse and zebrafish. • Clone Database (CloneDB) – information about available clones and libraries • Epigenomics – data from epigenetics studies • Influenza Genome Resources - Genome browsers and other resources NCBI – National Center for Biotechnology Information • • Major categories Genetics and Medicine • dbGaP – Database of Genotypes and Phenotypes, correlates genomic characteristics with observable traits • dbVar – structural variation (>50 bp) database • dbSNP – All types of short genetic variations (<50 bp), includes information about allele frequencies and genotypes • OMIA – Online Mendelian Inheritance in Animals, genes/traits associated with non-human animals • dbMHC, dbLRC, dbRBC – routine clinical applications, generally immunity related Genome browsers and other resources NCBI – National Center for Biotechnology Information • Major categories • PubChem – focuses on the chemical, structural and biological properties of small molecules and their roles as diagnostic and therapeutic agents • Domains and Structures • Molecular Modeling Database – Coordinate sets from a related protein database augmented with domain annotations, literature, related sequences, and conserved domains in CDD (below) • Conserved Domain Database (CDD) Genome browsers and other resources NCBI – National Center for Biotechnology Information • In class exercise – explore BLAST/NCBI using a query • >query ACCACTTCTCCTGACATTCAGTTTGGTGAACAGGGTTTGCCTCCATCCCAAGAAGATGTCCTGCACTGACC TGTGCTACCCATCGAGTGGGATCGCCTGCCCAAGGCCCTTTGCTGACAGCTGCAATGAGGCCTGTATCAGG CAGTGCCCTGACTCGAGGGTAGTGATCCAGCCACCACCAGTTGTTGTGACCATCCCAGGCCCCATTCTCAG CAACTGCCCTCAGGACAGCGTCGTGGGATCTGCAGGAGTACCCGCTGTGGGCCATGGAGCTGCTGGGGG AACTGCCCTGTCTGGGGGCCCCAGTGGTCCTGGGGGACATCTTGGTTATGGAGGCCTATATGGGGGACTT GGTGGTTATGGGGGCCTTGGTGGTTATGGGGGCCTTGGCGGTTATGGGGGCTTTGGTGGTTATGGGGGC CTTGGCGGTTATGGGGGACTTGGTGGTTATGGGGGATCTCTTGGTTCTGGGGGCCTCTGTGGTTATGGA GGATCTCTTGGTTCTGGGGGCCTCTGTGGTTATGGAGGATCTCTTGGTTATGGGGGCCTCTGTGGTTATG GGGGCCTCAGCTCTGGTTCTGGGAGCTGTTACAGCTCTGGGTACTGCAGCCCTTATCCCTACCGTCGATAC GGCAGGTACCGCTATGGAAGCTGCGGACCATGCTAAACCCAACAGGAAGTTCCACAGAAGCAGGAATCAAA AGAAGATGAAGAATATGATCCAGATACTTGGCTGAGCTACTGAGCAATGGGCTTGAGAAGGTCTGAGCACC CACAGCTCTAATGAAATCTAGTGGAGCTGCATCTGCTCAACACCTCCTAAATCTGTTGCCATGTTATATTTTA CCTTAAAATTTCTGACCTTTTGTCTGCTTCCCCAATCCTGTTTGCCTTTGCTTCATCATGTTTATTACTCACT GGGGTAAAAACTGTAACTTAAAATCTTCCCCAATGTCACTTGATTTTCCAACTCACAGCTTCAAAATAGAATT CTGTGAAGAGTATCTCTGAGGTCAACAGCGGTCCCTCAGAGTTGAGGATAATGTCTGTCAGGCTTTATGGG TCCTCAGGTGACTGAGGAGCCTGATTCTGGATACACACATTATTCAGCAATGTGGGCATGTTGTTGAACAA GGGGGTGGGATTCACAGTCCTGGGTTGGTTTCCTTCTCCTTCATCCACTGCCACCATCCCGCCTCAGTATC AAGTCATTGGGCCTCAAAGTGGTGTGCACTTTGGTGGACCAGGTGGCACCACACCGAATGGTCAGCAGCT TGCTTCTCCCAGCTGTTGGTGCCAACGCCTCCGTGCTTGAGGGCAACCTTGAGTGTGTCTTTGTACCATT TCCTTTGGCTGCCCTTTGACTCCTTTTTCCTAGGAGTAACTAACTGATTAAGCAGCTTACTGGTATTCTTTG TACTCATCAACACTTTGCTCTGCATATGTATACTTCACCCTTTTCTTCCCATTAAAATTATGTTGCATTATGAAA AAAAAAAAA Genome browsers and other resources NCBI – National Center for Biotechnology Information • In class exercise – explore BLAST/NCBI using a query • Search nt database • Results page - Search summary, taxonomy report, distance tree, e-value • Best hit page – source pub, tissue of origin, interpro link, UniProt link, GO terms, pick primers, export options, Find in this sequence (start codon, musashi binding site (AU(1-3)AGU), polyadenlyation site (AATAAA)), ID contigs making up scaffold • Search WGS Genomes – limit to Crocodylia • Results page - Search summary, taxonomy report, distance tree, e-value • Best hit page – Genbank link takes you to scaffold page, graphics takes you to genome browser • Try BlastX • ID conserved domains • http://www.ncbi.nlm.nih.gov/Class/FieldGuide/problem_set.html Genome browsers and other resources NCBI – National Center for Biotechnology Information • In class exercise – explore BLAST/NCBI using a query Use Entrez nucleotides to retrieve the finished record AC009453 from the human genome project. How many times has it been updated since it first appeared. Trace the history all the way back to the first version. Based on the update date when did this record first appear How many unordered pieces were there then? Now use electronic PCR (linked as a "hotspot" on the NCBI homepage to identify STS markers present in this record. How many are there? These include radiation hybrid and genetic markers. Notice that one of these markers is also a repeat polymorphism that is mapped on two human genetic maps (Marshfield and Genethon). Follow the links from the ePCR results to see which marker it is. Genome browsers and other resources NCBI – National Center for Biotechnology Information • In class exercise – explore BLAST/NCBI using a query Here is a sequence of DNA. agtttttcacatatctccatcgcctcagttgctatcaaca Use the NCBI database to identify the species from which it originated as well as your best guess as to what gene it belongs. Provide a screenshot of the blastn result. Interpret the results for me. Which hit is the ‘best’ hit? If there is no ‘best’ hit, how do you interpret these results? Genome browsers and other resources Ensembl • Soon after the publication of the human genome, it was clear that manual annotation of 3 billion base pairs of sequence would not be able to offer researchers timely access to the latest data. • Ensembl’s initial goals: to automatically annotate genomes integrate annotations with other available biological data make all this publicly available via the web. • Expanded goals include comparative genomics, variation and regulatory data. • There is absolutely no way I can cover everything that you can do while browsing Ensembl. I’ll just hit some highlights. Genome browsers and other resources Ensembl • Multiple ways to view particular genes/features • Examples UCP1 and BRCA2 • Just do a simple search for each and see what can be learned • Find out: • What each gene does. • Where it’s located. • What splice variants exist. • What homologs exist in other taxa. • If there are known sequence variants. • What evidence supports its annotation. • Regulatory features Genome browsers and other resources Ensembl • Multiple ways to view particular genes/features • Examples UCP1 and BRCA2 • Gene sequence view http://useast.ensembl.org/Homo_sapiens/Gene/Sequence?g=ENSG00000139618 ;r=13:32889611-32973805 • Exons view http://useast.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00 000139618;r=13:32889611-32973805;t=ENST00000544455 • cDNA view http://useast.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core; g=ENSG00000139618;r=13:32889611-32973805;t=ENST00000544455 Genome browsers and other resources UCSC Genome Browser The genome browser zooms and scrolls over Quickly maps your sequence to a genome. a BLAST chromosomes/scaffolds and shows the workNOT of annotators Provides clone. access to the underlying database Displays a sorted table of genes that are related (homology, Can quickly find sequences of 95% similarity and >40 bases Searches a sequence database with a pair of PCR primers. Returns expression profiles, proximity Genome-wide data (SNP studies, linkage studies, a fasta file with all sets sequences in the database that liemapping betweenstudies) the primers Online tools to manipulate large data sets In situ images from frog and mouse Liftover – genome coordinates converter. DNA duster removes formatting Genome from and annotation downloads for local searches and manipulations oddities input sequences. Protein duster. Tree gif maker. Add your custom annotations an and existing genome for individualized Source codes for the browser,to blat liftover analysis Web-based tools to visualize, integrate and analyze cancer genomics 300+ microbial species from Bacteria and Archaea. Basic gene annotation. and its associated clinical data. Sequence conservation data, nucleotide and protein motifs, non-coding RNA Search the multiple Neandertal genome assemblies predictions, operon predictions, gene expression data, high-throughput RNA sequencing. Genome browsers and other resources UCSC Genome Browser Genome browsers and other resources UCSC Genome Browser • In class exercise – explore Genome browser by examining genomic features of UCP1 • Enter UCP1 in search box • ID mRNA, conservation, SNPs, TEs, Drag track, zoom out and back in, base view, Track control options, neandertal reads, fosmid end pairs, ENCODE TF binding From UCP work, find mRNA for UCP1 Shift to microbat genome and submit to BLAT Identify best hit and explore browser and details views >Ves14#SINE/tRNA BLAT GCCGGGACCGGTTTGGCTCAGTGGATAGAGCGTCGGCCTGGGGACTGGAAGGTCCCGGGTTCGATTCCGGTCAAGGGCATGTACCTTGGTTGCGGGCACATCCCCGGTGGGGGGTGT GCGGAAGGCAGCCGGTCGATGTTTCTCTCTCATCGACGTTTCTAGCTCTCTATCTCTCTCCCTTCCTCCCTGTAAAAATCAATAAAATA • • In silico PCR – • Search M. lucifugus genome for these primers – • TTCCTCAAGGGGAATTGTCA / GAGTGTCTGCCTCCTTCCTG • GTCGCAGCATTCAGACTATAGTGATG / CGCTTTCCTTCGCCGCAATAAATTTCC Downloads Genome browsers and other resources UCSC Genome Browser • In class exercise – Find all TEs present in the last 1,000,000 bp of chromosome 22 that are shared with other primates • Use Table Browser • Region, position – enter chr22, will default to entire chromosome, change to last 1,000,000 bp • Group – Repeats • Intersection – create, group = comparative genomics, track = primate chain, overlap with primate chain • Output format – hyperlinks • Try again for TEs NOT shared with other primates