Lecture 2. Accessing and Visualizing Genomics Data

advertisement
Accessing and visualizing genomics data
Jim Noonan
GENE 760
A working definition of genomics
The global study of how biological information is encoded in
genome sequence
• Genes
• Regulatory sequences
• Genetic variation
How this information is read out to produce distinct biological
outcomes
• Gene expression and regulation
• Cellular identity, differentiation and development
• Phenotypic variation among individuals and species
Genomes are vast information repositories
Human 3 Gb
•kb = 1000 bp
•Mb = 1x106 bp
•Gb = 1x109 bp
•Tb = 1x1012 bp
•Pb = 1x1015 bp
1 Gb
10 Gb
100 Gb
Reference genomes
Obtaining a usable reference genome
1. Generating primary genome sequence:
Accurate sequence from a single individual assembled
into a usable form (ideally to chromosomes)
2. Annotating basic functions:
Gene locations, repeat content, etc.
3. Delivery and access:
Provide a simple, interactive interface for accessing data
Tools for integrating private experimental data with public annotation
Means to annotate genetic variation
Sequencing the reference human genome
(1990-present; ‘finished’ 2003)
•Industrialization of Sanger sequencing,
library construction, sample preparation,
analysis, etc.
•$3 billion total cost
•1 Gb/month at largest centers (2005)
•YCGA = 9.6 Tb per month (2011)
Genome assembly and annotation
>>109 sequencing reads
36 bp - 1 kb
3 Gb
Genome assembly
Assembly quality criteria:
Accuracy: number of errors
(Human << 1/100,000 bp)
Generate reads
Find overlapping reads
Contiguity: number of gaps
(Human: est. 357)
Coverage:
Average number of reads representing a
particular position in the assembly
Human, Mouse, Rat: > 20x
Chimpanzee:
~6x
Squirrel: ~2x
Assemble reads into contigs
contig
Join contigs into
scaffolds
mate pair
scaffold
Scaffold_0: 12,865,123 – 12,965-110
Join scaffolds into
“finished” sequence
anchored on chromosomes
AGTTGTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATATACAG
Chr5: 133,876,119 – 134,876,119
TATCATGCTTGTCATTCGTTTATCAGAGGCCAAATGTTTTTCTTTGTAAACGTGTGTAAAACATTCTCAGAATTTTAAACAATAACAAATCAGGGCTGAATGTGGCC
CTGTCCAAATCAAACAGTTGTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATA
AAGTTTAAGCAAGAAGAAAACAAAGACTGTTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCAC
TTTAGGAACAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCA
CCAGGCACAAGACCAGTATTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAAT
GTTAATTTCAAGTTGTAATTGATGCTAGAAAGACAATGAAACAGAGCCATGTGACCAATGAGAGAGATGAGGGTGGCAGCAGCCTGTTTTAGATAAGGTACCTGA
GATTAGTGTCTTCAGATATGCCTTAATGATATGAAAGAACCATTCATGGGAAGGCCTAGCATTAAAAACCGTCTAGGCAGAATGAGCAGCAAGTGCAAGGGTCCT
AGGAAGAAAGAGAAACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGT
CACATTAATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAA
ATTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAACA
ATTGATGCTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAA
ATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAA
GTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACAAACAAGTAAAT
GCTATCCCAGGCACAAGACCAGTATTATGTTCTAGGCATTGGGGATACCATTACCTGTCAATGTTATTAATATTTTTAGGAACAATAAATCACATTAATTCCAACATG
TGTCATTCGTTTATCAGAGGCCAAATGTTTTTCTTTGTAAACGTGTGTAAAACATTCTCAGAATTTTAAACAATAACAAATCAGGGCTGAATGTGGCCAACATGCAA
CAAACAGTTGTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATA
GCAAGAAGAAAACAAAGACTGTTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCT
CAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATT
AGACCAGTATTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGAC
CAAGTTGTAATTGATGCTAGAAAGACAATGAAACAGAGCCATGTGACCAATGAGAGAGATGAGGGTGGCAGCAGCCTGTTTTAGATAAGGTACCTGATTGGTGGG
GTCTTCAGATATGCCTTAATGATATGAAAGAACCATTCATGGGAAGGCCTAGCATTAAAAACCGTCTAGGCAGAATGAGCAGCAAGTGCAAGGGTCCTGGATAGG
AGAGAAACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATAT
CCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAAGTC
CTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAACAAGTAAATAA
ACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTTTAG
CTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAAGTCCCAGG
TTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACAAACAAGTAAATAAAGTTAAT
GGCACAAGACCAGTATTATGTTCTAGGCATTGGGGATACCATTACCTGTCAATGTTATTAATATTTTTAGGAACAATAAATCACATTAATTCCAACATGCAAAGAGGA
TTATCAGAGGCCAAATGTTTTTCTTTGTAAACGTGTGTAAAACATTCTCAGAATTTTAAACAATAACAAATCAGGGCTGAATGTGGCCAACATGCAAAGAGGAAATC
GTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATATACAGAGAA
ACAAAGACTGTTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTA
TTAATTCCTTATCTCATGTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAA
ATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAACAAGT
GATGCTAGAAAGACAATGAAACAGAGCCATGTGACCAATGAGAGAGATGAGGGTGGCAGCAGCCTGTTTTAGATAAGGTACCTGATTGGTGGGATTGGAAGACC
GCCTTAATGATATGAAAGAACCATTCATGGGAAGGCCTAGCATTAAAAACCGTCTAGGCAGAATGAGCAGCAAGTGCAAGGGTCCTGGATAGGAATGAGCTGGAT
GGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTTTAGGAACA
GTGAAATTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAAGTCCCAGGCACAA
GGATACCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAACAAGTAAATAAAGTTAATTTCA
ATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTTTAGGAACAATAAAT
TTTCATATTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTCAATAAATATTTTTTAGAATAATAAGTCCCAGGCACAAGACCAG
CCATGTTCACAAGACAGACTATGATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACAAACAAGTAAATAAAGTTAATTTCAAGTTGTA
CAGTATTATGTTCTAGGCATTGGGGATACCATTACCTGTCAATGTTATTAATATTTTTAGGAACAATAAATCACATTAATTCCAACATGCAAAGAGGAAATCTCCATA
GCCAAATGTTTTTCTTTGTAAACGTGTGTAAAACATTCTCAGAATTTTAAACAATAACAAATCAGGGCTGAATGTGGCCAACATGCAAAGAGGAAATCTCCCATCTG
AACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATATACAGAGAACATGCCAAAA
GTTACTATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTT
Genome annotation
Genes:
~3 billion bp
ACAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATATTTATGAT
TGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTTCATTC
AATAAATATTTTTTAGAATAATAAGTCCCAGGCACAAGACCAGTATTATG
TTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATTTACAG
GATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACACTAAA
CAAGTAAATAAAGTTAATTTCAAGTTGTAATTGATGCTAGAAAGACAATG
AAACAGAGCCATGTGACCAATGAGAGAGATGAGGGTGGCAGCAGCCT
GTTTTAGATAAGGTACCTGATTGGTGGGATTGGAAGACCTCTCTGAGAT
TAGTGTCTTCAGATATGCCTTAATGATATGAAAGAACCATTCATGGGAA
GGCCTAGCATTAAAAACCGTCTAGGCAGAATGAGCAGCAAGTGCAAGG
GTCCTGGATAGGAATGAGCTGGATATACTCAAGGAAGAAAGAGAAACT
ATGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTT
GTTAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATAT
TTTTAGGAACAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATA
TTTATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTT
TTTCATTCAATAAATATTTTTTAGAATAATAAGTCCCAGGCACAAGACCA
GTATTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATG
ATTTACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAG
ACACTAAACAAGTAAATAAAGTTAATTTCAAGTTGTAATTGATGCTACTA
TGGAAAAATGAAAATAGATTTTAAAACATGTTAATTCACGTTACTTTTTGT
TAAATTTACTTTTCTTCTTTCACTTCTTACCTGTCAATGTTATTAATATTTT
TAGGAACAATAAATCACATTAATTCCTTATCTCATGTGAAATTTCATATTT
ATGATTGATACCTTTAAATGTCATTTGTTGAAGGAAGATTATTCATTTTTT
CATTCAATAAATATTTTTTAGAATAATAAGTCCCAGGCACAAGACCAGTA
TTATGTTCTAGGCATTGGGGATACCATGTTCACAAGACAGACTATGATT
TACAGGATCAGATGTGGACTCTCAAATTCGACTGAGAATAAAACAGACA
CAAACAAGTAAATAAAGTTAATTTCAAGTTGTAATTGATGCTATCCCAGG
CACAAGACCA….
-
Coding, noncoding, miRNA, etc.
Isoforms
Expression
Genetic variation:
-
SNPs and CNVs
Sequence conservation
Regulatory sequences:
-
Promoters
Enhancers
Insulators
Epigenetics:
-
DNA methylation
Chromatin
Density of biological information in the human genome
Chr5: 133,876,119 – 134,876,119
Genes
Transcription
Histone mods
TF binding
Mouse
orthology
SNPs
Repeats
Annotation depth varies by species
Human, Mouse (Fly, Worm, Yeast):
- Chromosome assemblies
- Dense gene and regulatory maps, variation, etc.
Other models (Dog, Chicken, Zebrafish):
- Chromosome assemblies
- Partial gene maps; variation; little regulatory data
Low coverage vertebrate genomes:
- Scaffold assemblies
- Few annotated genes
- Used for comparative purposes
Accessing genomic data
Pre-packaged, public data
•
•
•
•
Genome assemblies
Gene models
Expression, regulation, variation (ENCODE, Epigenome)
Evolutionary conservation
• Someone else (usually a consortium) has generated the
data, done QC, and done rudimentary analysis
• Visualized via genome browsers, for the most part
Private data
•
•
•
•
Generated by you or someone else
Unannotated genome sequence
RNA-seq, ChIP-seq, Exome seq datasets, etc.
Raw data publicly available via download
Portals to access and interpret genomes
UCSC Genome Browser (genome.ucsc.edu):
Visualization, data recovery, simple analysis
(also genome-preview.ucsc.edu)
ENSEMBL (ensembl.org):
Visualization, data recovery, simple analysis
Integrative Genomics Viewer
(broadinstitute.org/software/igv/):
Local genome viewer (visualize local and remote data)
UCSC Genome Browser
genome.ucsc.edu
Wiki Page: genomewiki.ucsc.edu
Read the User Guide
Human genome main page
(Feb 2009 assembly)
There are multiple assemblies
for many genomes!
Different genome assemblies have different coordinate systems and may have different annotations:
chr2:236,438,403-236,438,948 in March 2006 (hg18) is chr2:236,773,664-236,774,209 in Feb 2009 (hg19)
Genome Viewer
Categories of data: displayed as tracks
Discrete intervals (genes) or continuous
(transcription)
Category: Genes and Gene Prediction
Hyperlinks and tabs for individual tracks
•
•
Go to track description page
Hide or show data in genome viewer
Some tracks include multiple datasets (‘subtracks’)
•
Go to track description page to select
Different assemblies have different
annotations!
Sample Genome Viewer image: PITX1
Base position
Gene model (discrete)
Transcription (continuous)
TF binding
SNPs
Repeats
Which gene annotation to use?
Gene description page and links to other resources
‘Layered’ tracks: Transcription
Display options
Subtracks
Integrating different types of annotation data
Integrating different types of annotation data
Proximal
enhancer
Promoter
Downloading primary annotation data
Wiki Page: genomewiki.ucsc.edu
Downloading primary annotation data
All available via FTP
Common Genome Browser file formats
BED format
• For interval data (e.g., exons)
• Tab-delimited format: chr start stop identifier
• BED coordinates are ‘zero-based, half-open’:
http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
• Position coordinates on the browser are 1-based. This leads to confusion if
you are not careful.
• chr16 80372593 80373755 is shown in the browser as chr16:80372594-80373755
• BEDTools: utilities for comparing genomic features you will use on your problem sets
WIG format
• For continuous data (e.g., the Transcriptome track mentioned earlier)
• WIG files are very large! BigWig is an alternative format you will
learn about in discussion.
The Table Browser
(under Tools)
Select datasets
Compare datasets
Download data
Integrating your own experimental data
Proximal
enhancer
Promoter
Mapping binding sites for a transcription factor of interest
Custom tracks and sessions
• Display and share your own data on the browser
• Custom tracks can be intersected, etc. in the Table Browser
Track Hubs
(under My Data)
Integrating Track Hub data with your own experimental data
Genome Browser utilities: BLAT
(under Tools)
• Rapidly find sequence locations in an assembly
• DNA sequences >24 bp and 95% identical to target genome
Assembly quality and annotation vary across genomes
Assembly not anchored to
chromosomes
Poor gene annotation
Assembly quality metrics
Whole-genome alignment to
mouse
Genome Browser utilities: LiftOver
(under Tools)
• Convert coordinates from one assembly to another (e.g., hg18 to hg19)
• Identify orthologous positions between genomes (e.g., human to mouse)
Ensembl portal
ensembl.org
Ensembl genome browser
ensembl.org
Accessing raw data: GEO
GEO: Functional genomics data (RNA-seq, ChIP-seq, etc.)
Variation data: dbSNP, dbGAP
Accessing raw data: GEO
Metadata:
Primary data files:
Wrap-up
Problem Set #1: Learn how access and manipulate genomic datasets
Next lecture: High-throughput sequencing technologies
Download