Lecture - Pharmacology

advertisement
(Human) Genomics
BIOM/PHAR206 – 05/19/2014
Olivier Harismendy, PhD
Division of Genome Information Sciences
Department of Pediatrics
Moores UCSD Cancer Center
UCSC Genome Browser
•
•
•
•
isPCR
BLAT
LiftOver
Track types
– BED minimum
– BED extended
– WIG
•
•
•
•
Track Display and Shuffle
Browser Navigation
Custom Session – Export Figure
Custom Tracks
0-based coordinates
Sequence
1 based
0 based
A|C|C|G|G|T|C|G|A
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Human Genome Assemblies
BED Track Formats
track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2
itemRgb="On"
chr7
127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0
chr7
127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0
chr7
127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0
chr7
127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0
chr7
127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255
chr7
127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255
chr7
127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255
chr7
127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0
chr7
127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255
BED Track Formats
Header: space separated parameters
•
name=<track_label>
•
description=<center_label>
•
type=<track_type> - Defines the track type. The track type attribute is required for BAM, BED detail,
bedGraph, bigBed, bigWig, broadPeak, narrowPeak, Microarray, VCF and WIG tracks.
•
visibility=<display_mode> 0 - hide, 1 - dense, 2 - full, 3 - pack, and 4 - squish.
•
color=<RRR,GGG,BBB> - Defines the main color for the annotation track.
•
itemRgb=On
•
colorByStrand=<RRR,GGG,BBB RRR,GGG,BBB> - Sets colors for + and - strands, in that order.
•
useScore=<use_score>
•
group=<group> •
priority=<priority> - When the group attribute is set, defines the display position of the track relative to
other tracks
•
db=<UCSC_assembly_name> - When set, indicates the specific genome assembly for which the annotation
data is intended;
•
offset=<offset> - Defines a number to be added to all coordinates in the annotation track. The default is
"0".
•
maxItems=<#> - Defines the maximum number of items the track can contain.
•
url=<external_url> - Defines a URL for an external link associated with this track.
•
htmlUrl=<external_url> - Defines a URL for an HTML description page to be displayed with this track.
•
bigDataUrl=<external_url> - Defines a URL to the data file for BAM, bigBed, bigWig or VCF tracks.
BED Track Formats
•
•
For intervals
Header: space separated configuration parameters
– chrom - The name of the chromosome
– chromStart - The starting position of the feature in the chromosome or scaffold. The
first base in a chromosome is numbered 0.
– chromEnd - The ending position of the feature in the chromosome or scaffold. The
chromEnd base is not included in the display of the feature.
– name - Defines the name of the BED line.
– score - A score between 0 and 1000.
– strand - Defines the strand - either '+' or '-'.
– thickStart - The starting position at which the feature is drawn thickly
– thickEnd - The ending position at which the feature is drawn thickly
– itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0).
– blockCount - The number of blocks (exons) in the BED line.
– blockSizes - A comma-separated list of the block sizes.
– blockStarts - A comma-separated list of block starts.
WIG track format
#
150 base wide bar graph at arbitrarily spaced positions,
#
threshold line drawn at y=11.76
#
autoScale off viewing range set to [0:25]
#
priority = 10 positions this as the first graph
#
Note, one-relative coordinate system in use for this format
track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full
autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
49304701 10.0
49304901 12.5
49305401 15.0
49305601 17.5
49305901 20.0
49306081 17.5
49306301 15.0
49306691 12.5
49307871 10.0
#
200 base wide points graph at every 300 bases, 50 pixel high graph
#
autoScale off and viewing range set to [0:1000]
#
priority = 20 positions this as the second graph
#
Note, one-relative coordinate system in use for this format
track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full
autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points
priority=20 fixedStep chrom=chr19 start=49307401 step=300 span=200
1000
900
800
700
600
500
400
300
200
100
Specific Tracks of interest
•
•
•
•
•
•
•
•
UCSC genes
RefSeq Genes
RepeatMasker
Conservation
TF motif predictions
dbSNP
ENCODE
Roadmap
Custom Sessions
•
•
•
•
Create an account
Customize the tracks displayed
Add you own track (limited in size and time)
Save and Share
Table Browser
•
•
•
•
Subset gene, region, genome
Output BED or fasta
Intersection
Filters
ENCODE / Roadmap Tracks
•
•
•
•
•
Track search
Cell Types / Tissue Types
Raw
Peaks
HMM
UNIX commands
• Head
• More (press Q to exit)
• Cat
– Example cat file
– Example cat file1 file2
• Grep
–
–
–
–
Grep –v ‘expression’
Grep –A 1 ‘expression’
Grep –B 2 ‘expression’
Example: grep –v ‘#’ file.txt to remove comments
• Expression metacharacters
–
–
–
–
–
$ end of line
$ beginning of line
[AB] A or B
* any character
Example: ‘CDKN*’ or ‘chr[1-7]’
UNIX commands
• Cut
– cut –f 1
– cut –f 3 –d ‘:’
• Sort
– sort –n
– sort –nr (or sort –n –r)
– sort –k 2
• uniq
– uniq
– uniq -c
• wc
– wc –l file.txt
– Example: cut –f 1 file | sort | uniq -c
UNIX commands
• Sed
– Sed ‘s/foo/bar/g’ file: find and replace
• Awk
– Awk ‘$3>2000’ file
– Awk ‘{if ($3>2000)
first 2 columns
– Awk ‘{sum+=$3} END
column 3
– Awk ‘{sum+=$3} END
average of column 3
: select row with 3rd field>2000
print $1,$2}’ file only print
{print sum}’ file print sum of
{print sum/NR}’ file print
• Join
– join –j 1 sorted_file1 sorted_file2
Demo #1 and #2
Human Genetic Variation
Highly Similar Genomes
DNA variants
(Sequence differences)
Phenotypic Differences
(Physical traits)
Variant Types
Frazer et al. 2009
Rahim, Harismendy et al (2008)
Variants from an individual genome
Within any given individual there are ~ 4 million
genetic variants encompassing ~ 12 Mb
Variants from multiple genomes
Within a given individual the majority of variants
are common.
Next Generation DNA analysis
• Whole genome sequencing
– Mutations (coding and non-coding)
– Translocations
– Copy Number Variants
• Whole Exome Sequencing
– Mutations (coding)
– ~Copy number variants (trisomia, gene
amplifications)
• Gene Panel
– Mutations (coding)
Variant Frequencies
• Common genetic variants
– second allele present at greater than 3% frequency
• Rare genetic variant
– present at less than 3% frequency, and commonly at very low
frequencies
• Private variants
– in limited families or single individuals
HapMap Project
Genotyped ~ 3.1 million SNPs in 270
individuals
–
–
–
–
90 Yoruba in Ibadan, Nigeria (YRI)
90 European descent in Utah, USA (CEU)
45 Han Chinese in Beijing, China (CHB)
45 Japanese in Tokyo, Japan (JPT)
Map of Genetic Variation
Relationships between common SNPs in
the human genome
Frazer et al (2007)
1000G Project
VCF format
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,sp
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS
ID
REF
ALT
QUAL FILTER INFO
20
14370
rs6054257 G
A
29
PASS
NS=3;DP=14;AF=0.5;DB;H2
20
17330
.
T
A
3
q10
NS=3;DP=11;AF=0.017
20
1110696 rs6040355 A
G,T
67
PASS
NS=2;DP=10;AF=0.333,0.667;AA=T;D
20
1230237 .
T
.
47
PASS
NS=3;DP=13;AA=T
20
1234567 microsat1 GTC
G,GTCT 50
PASS
NS=3;DP=9;AA=G
Linkage Disequilibrium (LD)
Given two biallelic sites there are four combinations that can be
observed with the following distributions.
SNP 1 = A/G
SNP 2 = A/C
SNP1SNP2
Case r2=1
Case r2=0
AA
70
25
AC
0
25
GA
0
25
GC
30
25
LD measure the level of correlation between SNPs
LD is the consequence of recombination at preferential sites
LD Bin structure example
LD bin = groups of SNPs with r2≥0.8
•The majority of common SNPs are in LD bins in the human genome
•Genotypes of a set of ~500,000 “tag SNPs” provide information (r2
≥ 0.8) regarding a large fraction (90%) of all 8 million common SNPs
present in humans.
GWAS principle
From phenotype to genotype
Tests if common SNPs tagging an interval in the
human genome are “associated” with a disease
http://www.mpg.de
GWAS results
Q1 2011
221 traits
1319 studies
>4000 associated SNPs
PR interval
WTCCC (2007)
Large number to test requires low p-value (5.10-8)
Sample sizes determine variant frequencies and effect size (Power)
GWAS highlights
• Many genes/loci not previously known to be involved in the
diseases studied
• Newly identified pathways suggest that molecular subphenotypes of common diseases may exist
• Many common diseases have the same associated genes
suggesting similar etiologies
GWAS limitations
– Genetic
• Small Effect sizes : only explains a small fraction (1-25%) of the
heritability
• Missing heritability can be hiding in
– Rare variants with large effects
– Epitasis (Gene x Gene interactions)
– Gene x Environment interaction (overlooked in heritability studies)
– Clinical
• Limited Prognostic value : classic marker (family history, life style)
work better
• Limited by ethnicity
– Functional
• Proxy SNPs are not the functional ones
• Genes associated by proximity : Variants are mostly outside
• Cell type and condition unknown
Demo #3
Cancer Types
Clinical Data Collected
age_at_initial_pathologic_diagnosis
100%
history_of_colon_polyps
icd_10
icd_o_3_histology
preoperative_pretreatment_cea_le
vel
89%
pretreatment_history
primary_lymph_node_presentation
99%
_assessment
100%
82%
60%
98%
ajcc_cancer_staging_handbook_edition
80%
icd_o_3_site
99%
primary_tumor_pathologic_spread
100%
anatomic_site_colorectal
bcr_patient_uuid
88%
100%
informed_consent_verified
kras_gene_analysis_performed
100%
89%
prior_diagnosis
race
100%
57%
braf_gene_analysis_performed
87%
kras_mutation_codon
4%
residual_tumor
82%
braf_gene_analysis_result
6%
kras_mutation_found
9%
synchronous_colon_cancer_present
87%
tissue_source_site
100%
tumor_stage
tumor_tissue_site
96%
100%
circumferential_resection_margin
colon_polyps_present
date_of_form_completion
loss_expression_of_mismatch_repair_protei
74%
ns_by_ihc
42%
98%
lymph_node_examined_count
100%
87%
lymphatic_invasion
10%
date_of_initial_pathologic_diagnosis
100%
lymphnode_pathologic_spread
100%
venous_invasion
83%
days_to_birth
days_to_death
100%
89%
microsatellite_instability
non_nodal_tumor_deposits
16%
43%
vital_status
weight
100%
51%
days_to_initial_pathologic_diagnosis
100%
number_of_abnormal_loci
12%
anatomic_organ_subdivision
2%
days_to_last_followup
96%
days_to_last_known_alive
61%
distant_metastasis_pathologic_spread
98%
number_of_lymphnodes_positive_by_he
94%
ethnicity
55%
number_of_lymphnodes_positive_by_ihc
9%
gender
height
100%
47%
patient_id
perineural_invasion_present
100%
33%
histological_type
99%
person_neoplasm_cancer_status
86%
number_of_first_degree_relatives_with_can
loss_expression_of_mismatch_repa
85%
cer_diagnosis
ir_proteins_by_ihc_result
12%
number_of_loci_tested
18%
Personal and history
Histology
Clinical
Molecular
Patients
Decreasing Intrinsic sensitivity
Clinical Data Collected
Dx to Tx
Tx
Tx to Recurrence
Recurrence to last FU
0
2000
Days after Dx
4000
6000
Molecular Data Collected
Molecule
Method
Measured entity
Data
RNA
microarrays
15,000 transcripts
Expression levels
RNA
RNA-Seq
All known and novel
trasncripts
Expression levels, isoform quantification,
editing, Novel transcripts, Fusion
Trasncripts
DNA
microarrays
100k to 1M SNP
Copy Number Aberrations, LoH,
Polymorphisms
DNA
Sanger Sequencing
30 M Base pairs
Coding Mutations
DNA
whole exome
sequencing
50 M Base pairs
DNA
whole genome
3 billion base pairs
DNA
Methylation Array
450,000 CpG
Methylation levels
DNA
Methylation Array
27,000 CpG
Methylation levels
Coding Mutations, Copy Number
Aberrations
Coding and Regulatory Mutations, Copy
Number Aberrations, Rearragements
Demo #4
Download