Accessing and visualizing genomics data Jim Noonan GENE 760 A working definition of genomics The global study of how biological information is encoded in genome sequence • Genes • Regulatory sequences • Genetic variation How this information is read out to produce distinct biological outcomes • Gene expression and regulation • Cellular identity, differentiation and development • Phenotypic variation among individuals and species Genomes are vast information repositories Human 3 Gb •kb = 1000 bp •Mb = 1x106 bp •Gb = 1x109 bp •Tb = 1x1012 bp •Pb = 1x1015 bp 1 Gb 10 Gb 100 Gb Reference genomes Obtaining a usable reference genome 1. Generating primary genome sequence: Accurate sequence from a single individual assembled into a usable form (ideally to chromosomes) 2. Annotating basic functions: Gene locations, repeat content, etc. 3. Delivery and access: Provide a simple, interactive interface for accessing data Tools for integrating private experimental data with public annotation Means to annotate genetic variation Sequencing the reference human genome (1990-present; ‘finished’ 2003) •Industrialization of Sanger sequencing, library construction, sample preparation, analysis, etc. •$3 billion total cost •1 Gb/month at largest centers (2005) •YCGA = 9.6 Tb per month (2011) Genome assembly and annotation >>109 sequencing reads 36 bp - 1 kb 3 Gb Genome assembly Assembly quality criteria: Accuracy: number of errors (Human << 1/100,000 bp) Generate reads Find overlapping reads Contiguity: number of gaps (Human: est. 357) Coverage: Average number of reads representing a particular position in the assembly Human, Mouse, Rat: > 20x Chimpanzee: ~6x Squirrel: ~2x Assemble reads into contigs contig Join contigs into scaffolds mate pair scaffold Scaffold_0: 12,865,123 – 12,965-110 Join scaffolds into “finished” sequence anchored on chromosomes AGTTGTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATATACAG Chr5: 133,876,119 – 134,876,119 TATCATGCTTGTCATTCGTTTATCAGAGGCCAAATGTTTTTCTTTGTAAACGTGTGTAAAACATTCTCAGAATTTTAAACAATAACAAATCAGGGCTGAATGTGGCC CTGTCCAAATCAAACAGTTGTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATA AAGTTTAAGCAAGAAGAAAACAAAGACTGTTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCAC TTTAGGAACAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCA CCAGGCACAAGACCAGTATTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAAT GTTAATTTCAAGTTGTAATTGATGCTAGAAAGACAATGAAACAGAGCCATGTGACCAATGAGAGAGATGAGGGTGGCAGCAGCCTGTTTTAGATAAGGTACCTGA GATTAGTGTCTTCAGATATGCCTTAATGATATGAAAGAACCATTCATGGGAAGGCCTAGCATTAAAAACCGTCTAGGCAGAATGAGCAGCAAGTGCAAGGGTCCT AGGAAGAAAGAGAAACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGT CACATTAATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAA ATTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAACA ATTGATGCTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAA ATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAA GTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACAAACAAGTAAAT GCTATCCCAGGCACAAGACCAGTATTATGTTCTAGGCATTGGGGATACCATTACCTGTCAATGTTATTAATATTTTTAGGAACAATAAATCACATTAATTCCAACATG TGTCATTCGTTTATCAGAGGCCAAATGTTTTTCTTTGTAAACGTGTGTAAAACATTCTCAGAATTTTAAACAATAACAAATCAGGGCTGAATGTGGCCAACATGCAA CAAACAGTTGTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATA GCAAGAAGAAAACAAAGACTGTTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCT CAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATT AGACCAGTATTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGAC CAAGTTGTAATTGATGCTAGAAAGACAATGAAACAGAGCCATGTGACCAATGAGAGAGATGAGGGTGGCAGCAGCCTGTTTTAGATAAGGTACCTGATTGGTGGG GTCTTCAGATATGCCTTAATGATATGAAAGAACCATTCATGGGAAGGCCTAGCATTAAAAACCGTCTAGGCAGAATGAGCAGCAAGTGCAAGGGTCCTGGATAGG AGAGAAACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATAT CCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAAGTC CTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAACAAGTAAATAA ACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTTTAG CTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAAGTCCCAGG TTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACAAACAAGTAAATAAAGTTAAT GGCACAAGACCAGTATTATGTTCTAGGCATTGGGGATACCATTACCTGTCAATGTTATTAATATTTTTAGGAACAATAAATCACATTAATTCCAACATGCAAAGAGGA TTATCAGAGGCCAAATGTTTTTCTTTGTAAACGTGTGTAAAACATTCTCAGAATTTTAAACAATAACAAATCAGGGCTGAATGTGGCCAACATGCAAAGAGGAAATC GTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATATACAGAGAA ACAAAGACTGTTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTA TTAATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAA ATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAACAAGT GATGCTAGAAAGACAATGAAACAGAGCCATGTGACCAATGAGAGAGATGAGGGTGGCAGCAGCCTGTTTTAGATAAGGTACCTGATTGGTGGGATTGGAAGACC GCCTTAATGATATGAAAGAACCATTCATGGGAAGGCCTAGCATTAAAAACCGTCTAGGCAGAATGAGCAGCAAGTGCAAGGGTCCTGGATAGGAATGAGCTGGAT GGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTTTAGGAACA GTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAAGTCCCAGGCACAA GGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAACAAGTAAATAAAGTTAATTTCA ATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTTTAGGAACAATAAAT TTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAAGTCCCAGGCACAAGACCAG CCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACAAACAAGTAAATAAAGTTAATTTCAAGTTGTA CAGTATTATGTTCTAGGCATTGGGGATACCATTACCTGTCAATGTTATTAATATTTTTAGGAACAATAAATCACATTAATTCCAACATGCAAAGAGGAAATCTCCATA GCCAAATGTTTTTCTTTGTAAACGTGTGTAAAACATTCTCAGAATTTTAAACAATAACAAATCAGGGCTGAATGTGGCCAACATGCAAAGAGGAAATCTCCCATCTG AACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATATACAGAGAACATGCCAAAA GTTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTT Genome annotation Genes: ~3 billion bp ACAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATATTTATGAT TGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTC AATAAATATTTTTTAGAATAATAAGTCCCAGGCACAAGACCAGTATTATG TTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAG GATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAA CAAGTAAATAAAGTTAATTTCAAGTTGTAATTGATGCTAGAAAGACAATG AAACAGAGCCATGTGACCAATGAGAGAGATGAGGGTGGCAGCAGCCT GTTTTAGATAAGGTACCTGATTGGTGGGATTGGAAGACCTCTCTGAGAT TAGTGTCTTCAGATATGCCTTAATGATATGAAAGAACCATTCATGGGAA GGCCTAGCATTAAAAACCGTCTAGGCAGAATGAGCAGCAAGTGCAAGG GTCCTGGATAGGAATGAGCTGGATATACTCAAGGAAGAAAGAGAAACT ATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTT GTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATAT TTTTAGGAACAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATA TTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTT TTTCATTCAATAAATATTTTTTAGAATAATAAGTCCCAGGCACAAGACCA GTATTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATG ATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAG ACACTAAACAAGTAAATAAAGTTAATTTCAAGTTGTAATTGATGCTACTA TGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGT TAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTT TAGGAACAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATATTT ATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTT CATTCAATAAATATTTTTTAGAATAATAAGTCCCAGGCACAAGACCAGTA TTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATT TACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACA CAAACAAGTAAATAAAGTTAATTTCAAGTTGTAATTGATGCTATCCCAGG CACAAGACCA…. - Coding, noncoding, miRNA, etc. Isoforms Expression Genetic variation: - SNPs and CNVs Sequence conservation Regulatory sequences: - Promoters Enhancers Insulators Epigenetics: - DNA methylation Chromatin Density of biological information in the human genome Chr5: 133,876,119 – 134,876,119 Genes Transcription Histone mods TF binding Mouse orthology SNPs Repeats Annotation depth varies by species Human, Mouse (Fly, Worm, Yeast): - Chromosome assemblies - Dense gene and regulatory maps, variation, etc. Other models (Dog, Chicken, Zebrafish): - Chromosome assemblies - Partial gene maps; variation; little regulatory data Low coverage vertebrate genomes: - Scaffold assemblies - Few annotated genes - Used for comparative purposes Accessing genomic data Pre-packaged, public data • • • • Genome assemblies Gene models Expression, regulation, variation (ENCODE, Epigenome) Evolutionary conservation • Someone else (usually a consortium) has generated the data, done QC, and done rudimentary analysis • Visualized via genome browsers, for the most part Private data • • • • Generated by you or someone else Unannotated genome sequence RNA-seq, ChIP-seq, Exome seq datasets, etc. Raw data publicly available via download Portals to access and interpret genomes UCSC Genome Browser (genome.ucsc.edu): Visualization, data recovery, simple analysis (also genome-preview.ucsc.edu) ENSEMBL (ensembl.org): Visualization, data recovery, simple analysis Integrative Genomics Viewer (broadinstitute.org/software/igv/): Local genome viewer (visualize local and remote data) UCSC Genome Browser genome.ucsc.edu Wiki Page: genomewiki.ucsc.edu Read the User Guide Human genome main page (Feb 2009 assembly) There are multiple assemblies for many genomes! Different genome assemblies have different coordinate systems and may have different annotations: chr2:236,438,403-236,438,948 in March 2006 (hg18) is chr2:236,773,664-236,774,209 in Feb 2009 (hg19) Genome Viewer Categories of data: displayed as tracks Discrete intervals (genes) or continuous (transcription) Category: Genes and Gene Prediction Hyperlinks and tabs for individual tracks • • Go to track description page Hide or show data in genome viewer Some tracks include multiple datasets (‘subtracks’) • Go to track description page to select Different assemblies have different annotations! Sample Genome Viewer image: PITX1 Base position Gene model (discrete) Transcription (continuous) TF binding SNPs Repeats Which gene annotation to use? Gene description page and links to other resources ‘Layered’ tracks: Transcription Display options Subtracks Integrating different types of annotation data Integrating different types of annotation data Proximal enhancer Promoter Downloading primary annotation data Wiki Page: genomewiki.ucsc.edu Downloading primary annotation data All available via FTP Common Genome Browser file formats BED format • For interval data (e.g., exons) • Tab-delimited format: chr start stop identifier • BED coordinates are ‘zero-based, half-open’: http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms • Position coordinates on the browser are 1-based. This leads to confusion if you are not careful. • chr16 80372593 80373755 is shown in the browser as chr16:80372594-80373755 • BEDTools: utilities for comparing genomic features you will use on your problem sets WIG format • For continuous data (e.g., the Transcriptome track mentioned earlier) • WIG files are very large! BigWig is an alternative format you will learn about in discussion. The Table Browser (under Tools) Select datasets Compare datasets Download data Integrating your own experimental data Proximal enhancer Promoter Mapping binding sites for a transcription factor of interest Custom tracks and sessions • Display and share your own data on the browser • Custom tracks can be intersected, etc. in the Table Browser Track Hubs (under My Data) Integrating Track Hub data with your own experimental data Genome Browser utilities: BLAT (under Tools) • Rapidly find sequence locations in an assembly • DNA sequences >24 bp and 95% identical to target genome Assembly quality and annotation vary across genomes Assembly not anchored to chromosomes Poor gene annotation Assembly quality metrics Whole-genome alignment to mouse Genome Browser utilities: LiftOver (under Tools) • Convert coordinates from one assembly to another (e.g., hg18 to hg19) • Identify orthologous positions between genomes (e.g., human to mouse) Ensembl portal ensembl.org Ensembl genome browser ensembl.org Accessing raw data: GEO GEO: Functional genomics data (RNA-seq, ChIP-seq, etc.) Variation data: dbSNP, dbGAP Accessing raw data: GEO Metadata: Primary data files: Wrap-up Problem Set #1: Learn how access and manipulate genomic datasets Next lecture: High-throughput sequencing technologies