Genome browsers and other resources

advertisement
Genome browsers and other resources
Purposes and features
 Browse genes in their genomic context.
 See features in and around specific genes
 Investigate genome organization
 Search and retrieve information on a gene- and genome-scale
 Compare genomes and genome regions
The big guys
 NCBI – http://www.ncbi.nlm.gov
 Ensembl – http://www.ensemble.org
 UCSC genome browser – http://genome.ucsc.edu
Others of note – RepBase, KEGG, JGI Genome portal
There is absolutely no way I can cover everything that you can do while browsing any
of these resources. I’ll just hit some highlights.
Genome browsers and other resources
There is absolutely no way I can cover everything that you can do while browsing any
of these resources. I’ll just hit some highlights.
Genome browsers and other resources
Every January, Nucleic Acids Research publishes a Database issue
The 2015 issue is 1274 pages long and has 172 manuscripts in 8 categories
1. Nucleic acid sequence, structure and regulation
2. Protein sequence and structure, motifs and domains
3. Metabolic and signaling pathways, enzymes
4. Viruses, bacteria, protozoa and fungi
5. Human genome, model organisms, comparative genomics
6. Genomic variation, diseases and drugs
7. Plant databases
8. Other databases
Genome browsers and other resources
Some from this years issue:
1. Nucleic Acid sequence, structure and regulation – highlights from 33 papers
 Database resources of the National Center for Biotechnology Information
 The European Bioinformatics Institute’s data resources 2014
 The DDBJ Japanese Genotype-phenotype Archive for genetic and phenotypic
human data
 GenBank
 euL1db – the European database of L1HS retrotransposon insertions in
humans
 ChiTaRS 2.1 – an improved database of chimeric transcripts….
 The Eukaryotic Promoter database
 trFdb: a databse for transfer RNA fragments
 miRDB: an online resource for microRNA target prediction and functional
annotations
 lncRNAdb v2.0: expanding the reference database for functional long
noncoding RNAs
 + four more lncRNA-related papers.NGSmethDB: an updated genome
resource for high quality, single-cytosine resolution methylomes
Genome browsers and other resources
Some from this years issue:
2. Protein sequence and structure, motifs and domains – highlights from 33 papers
 UniProtL a hub for protein information
 The InterPro protein families database
 CDD: NCBI’s conserved domain database
 InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic
 MoonProtL a database for proteins that are known to moonlight
 REBASE – a database for DNA restriction and modification
 Genome3D: exploiting structure to help users understand their sequences
Genome browsers and other resources
Some from this years issue:
3. Metabolic and signaling pathways, enzymes - highlights from 17 papers
 STRING v10: protein-protein interaction networks, integrated over the tree of life
 EzCatDB: the enzyme reaction database
 ProteomeScout: a repository and analysis resource for post-translational
modifications and proteins
Genome browsers and other resources
Some from this years issue:
4. Viruses, bacteria, protozoa and fungi – highlights from 14 papers
 HIV-1, human interaction database
 NCBI Viral Genomes Resource
 VirHostNet 2.0: surving the web of virus/host interactions data
 Update on RefSeq microbial genomes resources
 GenoBase: comprehensive resource database of Esherichia coli K-12
 TrypanoCyc: a community-led biochemical pathways database for Trypanosoma
brucei
Genome browsers and other resources
Some from this years issue:
5. Human genome, model organisms, comparative genomics – highlights from
16 papers
 Ensembl 2015
 The UCSC Genome Browser database: 2015 update
 Genomicus update 2015: KaryoView and MatrixView provide a genome-wide
perspective to multispecies genomics
 FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome
assembly and large-scale migration of genome annotations
 VectorBase: an updated bioinformatics resource for invertebrate vectors and other
organisms related with human diseases
 SuperFlyL a comparative database for quantified spatio-temporal gene expression
patterns in early dipteran embryos
 DoGSD: the dog and wolf genome SNP database
Genome browsers and other resources
Some from this years issue:
6. Genomic variation, diseases and drugs – highlights from 29 papers
 OMIM.org: Online Mendelian Ineritance in Man (OMIM*), an online catalog of
human genes and genetic disorders
 GRASP v2.0: an update on the Genome-Wide Repository of Associations between
SNPs and Phenotypes
 COSMIC: exporing the world’s knowledge of somatic mutations in human cancer
 The UCSC Cancer Genomics Browser: update 2015
 Mouse Tumor Biology (MTB)L a database of mouse models for human cancer
 BCCTBbp: the Breast Cancer Campaign Tissue Bank bioinformatics portal
 Cancer3D: understanding cancer mutations through protein structures
 11 total related to cancer
 EpilepsyGene: a genetic resource for genes and mutations related to epilepsy
 The Digital Aging Atlas: integrating the diversity of age-related changes into a
unified resource
Genome browsers and other resources
Some from this years issue:
7. Plant databases - highlights from 10 papers
 PLAZA 3.0: an access point for plant comparative genomics
 PNRD: a plant non-coding RNA database
 AraNet v2: an improved database of co-functional gene networks for the study of
Arabidopsis thaliana and 27 other nonmodel plant species
 Araport: the Arabidopsis Information Portal
 RiceVarMap: a comprehensive database of rice genomic variants
 The coffee genome hub: a resource for coffee genomes
Genome browsers and other resources
Some from this years issue:
7. Other databases – highlights from 12 papers
 Gene Ontology Consortium: going forward
 Genenames.org: the HGNC resources in 2015
 The Genomes OnLine Database (GOLD) v5: a metadata management system
based on four level (meta) genome project classification
 dArk: the database for eukaryotic genome and transcriptome assemblies in 2014
 GeneFriends: a human RNA-seq-based gene and transcript co-expression
database
Genome browsers and other resources
NCBI – National Center for Biotechnology Information
• Major categories
• DNA, RNA and protein
• RefSeq – non-redundant set of curated and computationally predicted
transcripts, proteins and genomic regions
• Genbank – the primary nucleotide sequence archive
• Subdivided into Nucleotide, EST, GSS and WGS
• Also provides predicted translations of coding sequences
• PopSet – related sequences and alignments from population, phylogenetic,
mutation and ecosystem studies
• Sequence Read Archive (SRA) – raw sequence reads and alignments generated
by next generation methods, the data that went into genome assemblies,
GWAS, transcriptomes, etc.
• Trace Archive – raw data from Sanger sequencing
• BioSample – Annotation of biological samples used in studies that ended up
contributing data to other repositories
• Protein clusters – sets of almost identical RefSeq proteins from multiple
genomes
• HIV-1/Human Protein Interaction Database – self-explanatory
Genome browsers and other resources
NCBI – National Center for Biotechnology Information
•
•
Major categories
BLAST Sequence analysis
• BLAST – sequence similarity searches of all types
• blastn, blastp, blastx, tblastn, tblastx
• Search nr, WGS, GSS, EST, etc. Can also limit by taxon, molecule, etc.
• Multiple output formats to ease processing
• Parsing of results possible based on E-value
• Primer-blast – uses primer3 for primer design
Genome browsers and other resources
NCBI – National Center for Biotechnology Information
•
•
Major categories
Genes and Expression
• Gene – curated sequences and descriptive information about genes with links
• RefSeqGene – stable, standard human genomic sequences with mRNAs for wellcharacterized human genes
• Conserved CDS Database – human and mouse coding regions
• Gene Expression Omnibus – Repository for high-throughput data generated by
next-gen and microarray methods
• UniGene – transcript sequences
• HomolGene – detects homologs by comparison to 21 eukaryotes
Genome browsers and other resources
NCBI – National Center for Biotechnology Information
•
•
Major categories
Genomes
• BioProject – central access point for information on genome projects
• Genome Reference Consortium - aims to produce assemblies of higher
eukaryotic genomes that best reflect complex allelic diversity consistent with
currently available data. Currently produces assemblies for human, mouse and
zebrafish.
• Clone Database (CloneDB) – information about available clones and libraries
• Epigenomics – data from epigenetics studies
• Influenza Genome Resources -
Genome browsers and other resources
NCBI – National Center for Biotechnology Information
•
•
Major categories
Genetics and Medicine
• dbGaP – Database of Genotypes and Phenotypes, correlates genomic
characteristics with observable traits
• dbVar – structural variation (>50 bp) database
• dbSNP – All types of short genetic variations (<50 bp), includes information
about allele frequencies and genotypes
• OMIA – Online Mendelian Inheritance in Animals, genes/traits associated with
non-human animals
• dbMHC, dbLRC, dbRBC – routine clinical applications, generally immunity
related
Genome browsers and other resources
NCBI – National Center for Biotechnology Information
•
Major categories
•
PubChem – focuses on the chemical, structural and biological properties of
small molecules and their roles as diagnostic and therapeutic agents
•
Domains and Structures
• Molecular Modeling Database – Coordinate sets from a related protein database
augmented with domain annotations, literature, related sequences, and
conserved domains in CDD (below)
• Conserved Domain Database (CDD)
Genome browsers and other resources
NCBI – National Center for Biotechnology Information
•
In class exercise – explore BLAST/NCBI using a query
•
>query
ACCACTTCTCCTGACATTCAGTTTGGTGAACAGGGTTTGCCTCCATCCCAAGAAGATGTCCTGCACTGACC
TGTGCTACCCATCGAGTGGGATCGCCTGCCCAAGGCCCTTTGCTGACAGCTGCAATGAGGCCTGTATCAGG
CAGTGCCCTGACTCGAGGGTAGTGATCCAGCCACCACCAGTTGTTGTGACCATCCCAGGCCCCATTCTCAG
CAACTGCCCTCAGGACAGCGTCGTGGGATCTGCAGGAGTACCCGCTGTGGGCCATGGAGCTGCTGGGGG
AACTGCCCTGTCTGGGGGCCCCAGTGGTCCTGGGGGACATCTTGGTTATGGAGGCCTATATGGGGGACTT
GGTGGTTATGGGGGCCTTGGTGGTTATGGGGGCCTTGGCGGTTATGGGGGCTTTGGTGGTTATGGGGGC
CTTGGCGGTTATGGGGGACTTGGTGGTTATGGGGGATCTCTTGGTTCTGGGGGCCTCTGTGGTTATGGA
GGATCTCTTGGTTCTGGGGGCCTCTGTGGTTATGGAGGATCTCTTGGTTATGGGGGCCTCTGTGGTTATG
GGGGCCTCAGCTCTGGTTCTGGGAGCTGTTACAGCTCTGGGTACTGCAGCCCTTATCCCTACCGTCGATAC
GGCAGGTACCGCTATGGAAGCTGCGGACCATGCTAAACCCAACAGGAAGTTCCACAGAAGCAGGAATCAAA
AGAAGATGAAGAATATGATCCAGATACTTGGCTGAGCTACTGAGCAATGGGCTTGAGAAGGTCTGAGCACC
CACAGCTCTAATGAAATCTAGTGGAGCTGCATCTGCTCAACACCTCCTAAATCTGTTGCCATGTTATATTTTA
CCTTAAAATTTCTGACCTTTTGTCTGCTTCCCCAATCCTGTTTGCCTTTGCTTCATCATGTTTATTACTCACT
GGGGTAAAAACTGTAACTTAAAATCTTCCCCAATGTCACTTGATTTTCCAACTCACAGCTTCAAAATAGAATT
CTGTGAAGAGTATCTCTGAGGTCAACAGCGGTCCCTCAGAGTTGAGGATAATGTCTGTCAGGCTTTATGGG
TCCTCAGGTGACTGAGGAGCCTGATTCTGGATACACACATTATTCAGCAATGTGGGCATGTTGTTGAACAA
GGGGGTGGGATTCACAGTCCTGGGTTGGTTTCCTTCTCCTTCATCCACTGCCACCATCCCGCCTCAGTATC
AAGTCATTGGGCCTCAAAGTGGTGTGCACTTTGGTGGACCAGGTGGCACCACACCGAATGGTCAGCAGCT
TGCTTCTCCCAGCTGTTGGTGCCAACGCCTCCGTGCTTGAGGGCAACCTTGAGTGTGTCTTTGTACCATT
TCCTTTGGCTGCCCTTTGACTCCTTTTTCCTAGGAGTAACTAACTGATTAAGCAGCTTACTGGTATTCTTTG
TACTCATCAACACTTTGCTCTGCATATGTATACTTCACCCTTTTCTTCCCATTAAAATTATGTTGCATTATGAAA
AAAAAAAAA
Genome browsers and other resources
NCBI – National Center for Biotechnology Information
•
In class exercise – explore BLAST/NCBI using a query
•
Search nt database
• Results page - Search summary, taxonomy report, distance tree, e-value
• Best hit page – source pub, tissue of origin, interpro link, UniProt link, GO
terms, pick primers, export options, Find in this sequence (start codon,
musashi binding site (AU(1-3)AGU), polyadenlyation site (AATAAA)), ID contigs
making up scaffold
•
Search WGS Genomes – limit to Crocodylia
• Results page - Search summary, taxonomy report, distance tree, e-value
• Best hit page – Genbank link takes you to scaffold page, graphics takes you to
genome browser
•
Try BlastX
• ID conserved domains
•
http://www.ncbi.nlm.nih.gov/Class/FieldGuide/problem_set.html
Genome browsers and other resources

NCBI – National Center for Biotechnology Information
•
In class exercise – explore BLAST/NCBI using a query

Use Entrez nucleotides to retrieve the finished record AC009453 from the
human genome project.

How many times has it been updated since it first appeared.

Trace the history all the way back to the first version. Based on the update date
when did this record first appear How many unordered pieces were there then?

Now use electronic PCR (linked as a "hotspot" on the NCBI homepage to identify
STS markers present in this record. How many are there? These include radiation
hybrid and genetic markers. Notice that one of these markers is also a repeat
polymorphism that is mapped on two human genetic maps (Marshfield and
Genethon). Follow the links from the ePCR results to see which marker it is.
Genome browsers and other resources

NCBI – National Center for Biotechnology Information
•
In class exercise – explore BLAST/NCBI using a query

Here is a sequence of DNA. agtttttcacatatctccatcgcctcagttgctatcaaca

Use the NCBI database to identify the species from which it originated as well as
your best guess as to what gene it belongs. Provide a screenshot of the blastn
result.

Interpret the results for me. Which hit is the ‘best’ hit? If there is no ‘best’ hit,
how do you interpret these results?
Genome browsers and other resources
Ensembl
•
Soon after the publication of the human genome, it was clear that manual
annotation of 3 billion base pairs of sequence would not be able to offer
researchers timely access to the latest data.
•
Ensembl’s initial goals:
 to automatically annotate genomes
 integrate annotations with other available biological data
 make all this publicly available via the web.
•
Expanded goals include comparative genomics, variation and regulatory data.
•
There is absolutely no way I can cover everything that you can do while
browsing Ensembl. I’ll just hit some highlights.
Genome browsers and other resources
Ensembl
•
Multiple ways to view particular genes/features
•
Examples UCP1 and BRCA2
•
Just do a simple search for each and see what can be learned
•
Find out:
• What each gene does.
• Where it’s located.
• What splice variants exist.
• What homologs exist in other taxa.
• If there are known sequence variants.
• What evidence supports its annotation.
• Regulatory features
Genome browsers and other resources
Ensembl
•
Multiple ways to view particular genes/features
•
Examples UCP1 and BRCA2
•
Gene sequence view http://useast.ensembl.org/Homo_sapiens/Gene/Sequence?g=ENSG00000139618
;r=13:32889611-32973805
•
Exons view http://useast.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00
000139618;r=13:32889611-32973805;t=ENST00000544455
•
cDNA view http://useast.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;
g=ENSG00000139618;r=13:32889611-32973805;t=ENST00000544455
Genome browsers and other resources

UCSC Genome Browser
The genome browser zooms and scrolls over
Quickly
maps your sequence
to a genome.
a BLAST
chromosomes/scaffolds
and shows
the workNOT
of annotators
Provides
clone. access to the underlying database
Displays
a sorted
table of genes
that
are related
(homology,
Can
quickly
find sequences
of 95%
similarity
and >40
bases
Searches
a
sequence
database
with
a
pair
of
PCR
primers.
Returns
expression profiles, proximity
Genome-wide
data
(SNP studies,
linkage studies,
a
fasta file with
all sets
sequences
in the database
that liemapping
betweenstudies)
the primers
Online tools to manipulate large data sets
In situ images from frog and mouse
Liftover – genome coordinates converter. DNA duster removes formatting
Genome from
and annotation
downloads
for local
searches
and
manipulations
oddities
input sequences.
Protein
duster.
Tree gif
maker.
Add
your
custom
annotations
an and
existing
genome for individualized
Source
codes
for the
browser,to
blat
liftover
analysis
Web-based tools to visualize, integrate and analyze cancer genomics
300+
microbial
species
from
Bacteria and Archaea. Basic gene annotation.
and its
associated
clinical
data.
Sequence conservation data, nucleotide and protein motifs, non-coding RNA
Search the multiple Neandertal genome assemblies
predictions, operon predictions, gene expression data, high-throughput RNA
sequencing.
Genome browsers and other resources

UCSC Genome Browser
Genome browsers and other resources

UCSC Genome Browser
•
In class exercise – explore Genome browser by examining genomic features of UCP1
•
Enter UCP1 in search box
•

ID mRNA, conservation, SNPs, TEs,

Drag track, zoom out and back in, base view,

Track control options, neandertal reads, fosmid end pairs, ENCODE TF binding

From UCP work, find mRNA for UCP1

Shift to microbat genome and submit to BLAT

Identify best hit and explore browser and details views

>Ves14#SINE/tRNA
BLAT
GCCGGGACCGGTTTGGCTCAGTGGATAGAGCGTCGGCCTGGGGACTGGAAGGTCCCGGGTTCGATTCCGGTCAAGGGCATGTACCTTGGTTGCGGGCACATCCCCGGTGGGGGGTGT
GCGGAAGGCAGCCGGTCGATGTTTCTCTCTCATCGACGTTTCTAGCTCTCTATCTCTCTCCCTTCCTCCCTGTAAAAATCAATAAAATA
•
•
In silico PCR –
•
Search M. lucifugus genome for these primers –
•
TTCCTCAAGGGGAATTGTCA / GAGTGTCTGCCTCCTTCCTG
•
GTCGCAGCATTCAGACTATAGTGATG / CGCTTTCCTTCGCCGCAATAAATTTCC
Downloads
Genome browsers and other resources

UCSC Genome Browser
•
In class exercise – Find all TEs present in the last 1,000,000 bp of
chromosome 22 that are shared with other primates
•
Use Table Browser
•
Region, position – enter chr22, will default to entire chromosome, change to
last 1,000,000 bp
•
Group – Repeats
•
Intersection – create, group = comparative genomics, track = primate chain,
overlap with primate chain
•
Output format – hyperlinks
•
Try again for TEs NOT shared with other primates
Download