(Human) Genomics BIOM/PHAR206 – 05/19/2014 Olivier Harismendy, PhD Division of Genome Information Sciences Department of Pediatrics Moores UCSD Cancer Center UCSC Genome Browser • • • • isPCR BLAT LiftOver Track types – BED minimum – BED extended – WIG • • • • Track Display and Shuffle Browser Navigation Custom Session – Export Figure Custom Tracks 0-based coordinates Sequence 1 based 0 based A|C|C|G|G|T|C|G|A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Human Genome Assemblies BED Track Formats track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0 chr7 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255 BED Track Formats Header: space separated parameters • name=<track_label> • description=<center_label> • type=<track_type> - Defines the track type. The track type attribute is required for BAM, BED detail, bedGraph, bigBed, bigWig, broadPeak, narrowPeak, Microarray, VCF and WIG tracks. • visibility=<display_mode> 0 - hide, 1 - dense, 2 - full, 3 - pack, and 4 - squish. • color=<RRR,GGG,BBB> - Defines the main color for the annotation track. • itemRgb=On • colorByStrand=<RRR,GGG,BBB RRR,GGG,BBB> - Sets colors for + and - strands, in that order. • useScore=<use_score> • group=<group> • priority=<priority> - When the group attribute is set, defines the display position of the track relative to other tracks • db=<UCSC_assembly_name> - When set, indicates the specific genome assembly for which the annotation data is intended; • offset=<offset> - Defines a number to be added to all coordinates in the annotation track. The default is "0". • maxItems=<#> - Defines the maximum number of items the track can contain. • url=<external_url> - Defines a URL for an external link associated with this track. • htmlUrl=<external_url> - Defines a URL for an HTML description page to be displayed with this track. • bigDataUrl=<external_url> - Defines a URL to the data file for BAM, bigBed, bigWig or VCF tracks. BED Track Formats • • For intervals Header: space separated configuration parameters – chrom - The name of the chromosome – chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. – chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. – name - Defines the name of the BED line. – score - A score between 0 and 1000. – strand - Defines the strand - either '+' or '-'. – thickStart - The starting position at which the feature is drawn thickly – thickEnd - The ending position at which the feature is drawn thickly – itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). – blockCount - The number of blocks (exons) in the BED line. – blockSizes - A comma-separated list of the block sizes. – blockStarts - A comma-separated list of block starts. WIG track format # 150 base wide bar graph at arbitrarily spaced positions, # threshold line drawn at y=11.76 # autoScale off viewing range set to [0:25] # priority = 10 positions this as the first graph # Note, one-relative coordinate system in use for this format track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 49304701 10.0 49304901 12.5 49305401 15.0 49305601 17.5 49305901 20.0 49306081 17.5 49306301 15.0 49306691 12.5 49307871 10.0 # 200 base wide points graph at every 300 bases, 50 pixel high graph # autoScale off and viewing range set to [0:1000] # priority = 20 positions this as the second graph # Note, one-relative coordinate system in use for this format track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20 fixedStep chrom=chr19 start=49307401 step=300 span=200 1000 900 800 700 600 500 400 300 200 100 Specific Tracks of interest • • • • • • • • UCSC genes RefSeq Genes RepeatMasker Conservation TF motif predictions dbSNP ENCODE Roadmap Custom Sessions • • • • Create an account Customize the tracks displayed Add you own track (limited in size and time) Save and Share Table Browser • • • • Subset gene, region, genome Output BED or fasta Intersection Filters ENCODE / Roadmap Tracks • • • • • Track search Cell Types / Tissue Types Raw Peaks HMM UNIX commands • Head • More (press Q to exit) • Cat – Example cat file – Example cat file1 file2 • Grep – – – – Grep –v ‘expression’ Grep –A 1 ‘expression’ Grep –B 2 ‘expression’ Example: grep –v ‘#’ file.txt to remove comments • Expression metacharacters – – – – – $ end of line $ beginning of line [AB] A or B * any character Example: ‘CDKN*’ or ‘chr[1-7]’ UNIX commands • Cut – cut –f 1 – cut –f 3 –d ‘:’ • Sort – sort –n – sort –nr (or sort –n –r) – sort –k 2 • uniq – uniq – uniq -c • wc – wc –l file.txt – Example: cut –f 1 file | sort | uniq -c UNIX commands • Sed – Sed ‘s/foo/bar/g’ file: find and replace • Awk – Awk ‘$3>2000’ file – Awk ‘{if ($3>2000) first 2 columns – Awk ‘{sum+=$3} END column 3 – Awk ‘{sum+=$3} END average of column 3 : select row with 3rd field>2000 print $1,$2}’ file only print {print sum}’ file print sum of {print sum/NR}’ file print • Join – join –j 1 sorted_file1 sorted_file2 Demo #1 and #2 Human Genetic Variation Highly Similar Genomes DNA variants (Sequence differences) Phenotypic Differences (Physical traits) Variant Types Frazer et al. 2009 Rahim, Harismendy et al (2008) Variants from an individual genome Within any given individual there are ~ 4 million genetic variants encompassing ~ 12 Mb Variants from multiple genomes Within a given individual the majority of variants are common. Next Generation DNA analysis • Whole genome sequencing – Mutations (coding and non-coding) – Translocations – Copy Number Variants • Whole Exome Sequencing – Mutations (coding) – ~Copy number variants (trisomia, gene amplifications) • Gene Panel – Mutations (coding) Variant Frequencies • Common genetic variants – second allele present at greater than 3% frequency • Rare genetic variant – present at less than 3% frequency, and commonly at very low frequencies • Private variants – in limited families or single individuals HapMap Project Genotyped ~ 3.1 million SNPs in 270 individuals – – – – 90 Yoruba in Ibadan, Nigeria (YRI) 90 European descent in Utah, USA (CEU) 45 Han Chinese in Beijing, China (CHB) 45 Japanese in Tokyo, Japan (JPT) Map of Genetic Variation Relationships between common SNPs in the human genome Frazer et al (2007) 1000G Project VCF format ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,sp ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;D 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G Linkage Disequilibrium (LD) Given two biallelic sites there are four combinations that can be observed with the following distributions. SNP 1 = A/G SNP 2 = A/C SNP1SNP2 Case r2=1 Case r2=0 AA 70 25 AC 0 25 GA 0 25 GC 30 25 LD measure the level of correlation between SNPs LD is the consequence of recombination at preferential sites LD Bin structure example LD bin = groups of SNPs with r2≥0.8 •The majority of common SNPs are in LD bins in the human genome •Genotypes of a set of ~500,000 “tag SNPs” provide information (r2 ≥ 0.8) regarding a large fraction (90%) of all 8 million common SNPs present in humans. GWAS principle From phenotype to genotype Tests if common SNPs tagging an interval in the human genome are “associated” with a disease http://www.mpg.de GWAS results Q1 2011 221 traits 1319 studies >4000 associated SNPs PR interval WTCCC (2007) Large number to test requires low p-value (5.10-8) Sample sizes determine variant frequencies and effect size (Power) GWAS highlights • Many genes/loci not previously known to be involved in the diseases studied • Newly identified pathways suggest that molecular subphenotypes of common diseases may exist • Many common diseases have the same associated genes suggesting similar etiologies GWAS limitations – Genetic • Small Effect sizes : only explains a small fraction (1-25%) of the heritability • Missing heritability can be hiding in – Rare variants with large effects – Epitasis (Gene x Gene interactions) – Gene x Environment interaction (overlooked in heritability studies) – Clinical • Limited Prognostic value : classic marker (family history, life style) work better • Limited by ethnicity – Functional • Proxy SNPs are not the functional ones • Genes associated by proximity : Variants are mostly outside • Cell type and condition unknown Demo #3 Cancer Types Clinical Data Collected age_at_initial_pathologic_diagnosis 100% history_of_colon_polyps icd_10 icd_o_3_histology preoperative_pretreatment_cea_le vel 89% pretreatment_history primary_lymph_node_presentation 99% _assessment 100% 82% 60% 98% ajcc_cancer_staging_handbook_edition 80% icd_o_3_site 99% primary_tumor_pathologic_spread 100% anatomic_site_colorectal bcr_patient_uuid 88% 100% informed_consent_verified kras_gene_analysis_performed 100% 89% prior_diagnosis race 100% 57% braf_gene_analysis_performed 87% kras_mutation_codon 4% residual_tumor 82% braf_gene_analysis_result 6% kras_mutation_found 9% synchronous_colon_cancer_present 87% tissue_source_site 100% tumor_stage tumor_tissue_site 96% 100% circumferential_resection_margin colon_polyps_present date_of_form_completion loss_expression_of_mismatch_repair_protei 74% ns_by_ihc 42% 98% lymph_node_examined_count 100% 87% lymphatic_invasion 10% date_of_initial_pathologic_diagnosis 100% lymphnode_pathologic_spread 100% venous_invasion 83% days_to_birth days_to_death 100% 89% microsatellite_instability non_nodal_tumor_deposits 16% 43% vital_status weight 100% 51% days_to_initial_pathologic_diagnosis 100% number_of_abnormal_loci 12% anatomic_organ_subdivision 2% days_to_last_followup 96% days_to_last_known_alive 61% distant_metastasis_pathologic_spread 98% number_of_lymphnodes_positive_by_he 94% ethnicity 55% number_of_lymphnodes_positive_by_ihc 9% gender height 100% 47% patient_id perineural_invasion_present 100% 33% histological_type 99% person_neoplasm_cancer_status 86% number_of_first_degree_relatives_with_can loss_expression_of_mismatch_repa 85% cer_diagnosis ir_proteins_by_ihc_result 12% number_of_loci_tested 18% Personal and history Histology Clinical Molecular Patients Decreasing Intrinsic sensitivity Clinical Data Collected Dx to Tx Tx Tx to Recurrence Recurrence to last FU 0 2000 Days after Dx 4000 6000 Molecular Data Collected Molecule Method Measured entity Data RNA microarrays 15,000 transcripts Expression levels RNA RNA-Seq All known and novel trasncripts Expression levels, isoform quantification, editing, Novel transcripts, Fusion Trasncripts DNA microarrays 100k to 1M SNP Copy Number Aberrations, LoH, Polymorphisms DNA Sanger Sequencing 30 M Base pairs Coding Mutations DNA whole exome sequencing 50 M Base pairs DNA whole genome 3 billion base pairs DNA Methylation Array 450,000 CpG Methylation levels DNA Methylation Array 27,000 CpG Methylation levels Coding Mutations, Copy Number Aberrations Coding and Regulatory Mutations, Copy Number Aberrations, Rearragements Demo #4