Institute for Computational Biomedicine Basics of high-throughput sequencing Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD CSHL High Throughput Data Analysis Workshop, June 2012 Plan 1. What high-throughput sequencing is used for 2. Illumina technology 3. Primary data analysis (alignment, QC) 4. Read formats 5. Secondary Analysis (mutation calling, transcript level quantification, etc) 6. Read data visualization 7. Useful R/BioC packages 8. Challenges and evolution of sequencing and its analysis 1. What high-throughput sequencing is used for Full genome sequencing Targeted sequencing Exome sequencing DNA methylation profiling Bisulfite treatment mC C CU After PCR CC UT RNA-seq ChIP-seq DNA Transcription factor of interest Antibody High-throughput mapping of chromatin interactions (HiC) Elemento lab (more on this next week) And many others • Gene fusion detection • Translational profiling (which mRNAs localize to ribosomes) • Small/miRNA sequencing • Bacterial communities • Protein-RNA interactions (PAR-CLIP, HITS-CLIP) • … 2. Illumina technology Illumina SBS Technology Reversible Terminator Chemistry Foundation DNA (0.1-1.0 ug) 3’ 5’ A G C T G C T A C G A T A C C C G A T C G A T A T C G A T G C T Sample preparation Single molecule Cluster growtharray 5’ Sequencing 1 2 3 4 5 6 7 8 9 T G C T A C G A T … Image acquisition Base calling http://seqanswers.com/forums/showthread.php?t=21 http://www.illumina.com/technology/sequencing_technology.ilmn © Illumina, Inc. Single end vs pair end sequencing What comes out of the machine: short reads in fastq format @D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1 CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG +D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1 [^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa @D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1 TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC +D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1 ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1 TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC +D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1 _[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh @D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1 GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC +D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1 \^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb @D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1 QS to int In R: as.integer( charToRaw (‘e'))-33 Pair end sequencing s_8_1_sequence.txt.gz @D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1 CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG +D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1 [^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa @D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1 TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC +D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1 ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1 TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC +D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1 _[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh @D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1 GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC +D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1 \^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb @D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1 GTGGCCGATTCCTGAGCTGTGTTTGAGGAGAGGGCGGAGTGCCATCTGGGTAGC +D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1 aa_eeeeegggggihhiiifgeghfeghbgcghifiidg^dbgggeeeee`dcd … s_8_2_sequence.txt.gz @D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2 GGCATATTTAACAGCATTGAACAGAATTCTGTGTCCTGTAAAAAAATTAGCTTA +D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2 a__aaa`ce`cgcffdf_acda^ea]befffbeged`g[a`e_caaac]cb`gb @D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2 TTGAGGCTGTTGTCATACTTCTCATGGTTCACACCCATGACGAACATGGGGGCG +D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2 a__eeeeeggegefhhhiiihhhhhiieghhhghhiiffhiififhhiihegic @D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2 CGGGGTGCACCTCGTCGTAGAGGAACTCTGCCGTCAGCTCTGCCCCATCGCCAA +D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2 ^__ee__cge`cghghhfgddgfgi]ehhfffff^ec[beegidffhhfhadba @D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2 CTTAGTCTCAGTTTTCCTCCAGCAGCCTGAGGAAACTCAAAGGCACAGTTCCCA +D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2 _abeaaacg^g^eghhhhgafghhdfghfedeghfiiicfbgdHYagfeecggf @D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2 TAGGCTCAAAGTCTAACGCCAATCCCGAACCTGGGCATCTGTACACACACACAC +D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2 abbeceeegggcghiihiihhhhiifhiiiiihiiiiiiihegh`eggfebfhg … Illumina sequencing using HiSeq2000 • Previously: GAIIx: ~30M reads per lane, 8 lanes (1QC) • • Now: HiSeq2000 + TruSeq v3: 200M reads per lane, 816 lanes (1-2QC) in parallel with HiSeq2000 • Multiplexing: attach barcode, mix samples, sequence, identify and remove barcode Full Genome Sequencing using Illumina technology • ~$4-6K reagent with Illumina (storage+analysis costs not included) • Exercise: you want to sequence 1 human genome at 100X coverage; how many lanes ? QC for Illumina (part 1) 3’ 5’ A G C T G C T A C G A T A C C C G A T C G A T A T C G A T G C T 5’ Sequencing 3. Primary data analysis (alignment, QC) Read alignment programs • BWA (Burrows-Wheeler Aligner) – – – – http://bio-bwa.sourceforge.net/ Fast, accurate, can find (short) indels Allow 1-3 mismatches by default Can also align longer 454 reads • Bowtie – – – – http://bowtie-bio.sourceforge.net/index.shtml Ultrafast, accurate, newest version finds indels too Allow 1-3 mismatches by default Integrated into TopHat (splice aligner) • Others: Eland, Maq, SOAP, etc BWA tutorial (for aligning single end reads to genome) • Get genome, e.g., from UCSC – http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz • Combine into 1 file – tar zvfx chromFa.tar.gz – cat *.fa > wg.fa • Indexing the genome – bwa index -p hg19bwaidx -a bwtsw wg.fa • Align – bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa • Convert to SAM format – bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam Aligning pair end reads • Align two files separately – bwa aln -t 4 hg19bwaidx s_3_1_sequence.txt.gz > s_3_1_sequence.txt.bwa – bwa aln -t 4 hg19bwaidx s_3_2_sequence.txt.gz > s_3_1_sequence.txt.bwa • Convert to SAM format – bwa sampe hg19bwaidx s_3_1_sequence.txt.bwa s_3_1_sequence.txt.bwa s_3_1_sequence.txt.gz s_3_1_sequence.txt.gz > s_3_sequence.txt.sam TopHat (spliced alignment) Download genome index ftp://ftp.cbcb.umd.edu/pub/data/bowtie_inde xes/hg18.ebwt.zip D~100bp tophat –r 100 –p 4 –o outdir/ hg18 s_1_1_sequence.txt s_1_2_sequence.txt Trapnell et al, 2009 Basic QC • Fraction of mapped reads • How many unique mappers ? • Fraction of clonal reads (PCR duplicates) 4. Read formats Read formats • SAM/BAM • Eland/Eland Export SAM format DH1608P1_0130:6:1103:10579:166379#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG eb`XXYbZdadee^ceV]X][ccTcc^ebeece eeeWbeeeeeeeceeaee XX:Z:NM_017871,32 NM:i:0 MD:Z:51 DH1608P1_0130:6:1102:3415:150915#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGGGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBac]bbbceedaeddeZceeea_ba_\_eee eeeedaeeee XX:Z:NM_017871,32 NM:i:1 MD:Z:5T45 DH1608P1_0130:6:1102:13118:62644#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGTGCCTCGGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBBBBBBBBBBB`XTbSa`cffegdggeccbe effdeggggg XX:Z:NM_017871,32 NM:i:2 MD:Z:7A3T39 DH1608P1_0130:6:1203:3012:157120#TTAGGC 16 chr1 1249826 25 51M * 0 0 AAGGCCGTGACTCTGATCTCAGCCCTCGTCTCCGCCGCGCTCCCGGACCCG BBBBBBBB^`QWZZ]UXYSZSTFRU]Z__SO[adcc[acdV \`Y]YWY][_ XX:Z:NM_017871,34 NM:i:3 MD:Z:4G17G1A26 DH1608P1_0130:6:2206:4445:12756#TTAGGC 16 chr1 1246336 25 1M3487N50M * 0 0 CCAAAGGGTGTGACTCTGATCTCGGGCATCGTCTCCGCCGCGCTCCCGGAC BBBBBBBBBBBBBBBBBBBBBBBB`YdddYdc\ cacaNddddcdddaeeee XX:Z:NM_017871,37 NM:i:3 MD:Z:2C5C14A27 DH1608P1_0130:6:2203:7903:43788#TTAGGC 16 chr1 1246336 37 1M3487N50M * 0 0 CCCAAGGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGAC adbe[fbcbccb_cb^cb^^c^edgegggggdf ggefffgfbfggggegeg XX:Z:NM_017871,37 NM:i:0 MD:Z:51 CIGAR string, eg 5M3487N46M = 5bp-long block, 3487 insert, 46bp-long block MD tag, e.g, MD:Z:4T46 = 5 matches, 1 mismatch (T in read), 46 matches XT tag, e.g. XT:A:U = unique mapper; XT:A:R = more than 1 high-scoring matches Pair end SAM D3B4KKQ1_0161:8:2206:11080:31374#CTTGTA 83 chr1 4481348 255 TTAGATGCATTTTCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAG hiiiiiiihihhdhghggdiiihihffihhheihihhhgggggeeeeebbb NM:i:0 NH:i:1 D3B4KKQ1_0161:8:2206:8294:192062#CTTGTA 147 chr1 4481355 255 CATTTTCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACAC efehffhgfdiihhhhhihghiiihfhihdhiihgghigefggeeeeebbb NM:i:0 NH:i:1 D3B4KKQ1_0161:8:2204:6985:145082#CTTGTA 147 chr1 4481360 255 TCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTA ghfhgihihghgihgiiiifiiiiihhhhfifhihhiigggeeceeeea__ NM:i:0 NH:i:1 D3B4KKQ1_0161:8:2205:15014:60805#CTTGTA 83 chr1 4481360 255 TCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTA hihheiihiiiiiiiiiiiiiiiiiifhiefhiiiiiigggggeceeebba NM:i:0 NH:i:1 D3B4KKQ1_0161:8:1105:17802:25847#CTTGTA 83 chr1 4481362 255 TTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAAT gheiiiihhhiiiiiiiiiihiiiiiihgfiiiiiiiigeggceeeeebb_ NM:i:0 NH:i:1 D3B4KKQ1_0161:8:1208:2232:73719#CTTGTA 147 chr1 4481366 255 CATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAATTGTA fhghiiiiiiiiiiiiiiiiiiihghiihiiiiihgggegfggeeeeebbb NM:i:0 NH:i:1 D3B4KKQ1_0161:8:2104:18142:93861#CTTGTA 83 chr1 4481367 255 ATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAATTGTAT ihghiiiheiiiiihhihfhifgghhhhfgfhiggge_ggggeeeeee_bb NM:i:0 NH:i:1 NM=edit distance 51M = 4481165 0 51M = 4481284 0 51M = 4481202 0 51M = 4481238 0 51M = 4481198 0 51M = 4481277 0 51M = 4481198 0 NH=number of alignments for that read BAM format • Compressed, indexable version of SAM • Can be uploaded to UCSC Genome Browser SAMtools • http://samtools.sourceforge.net/ • Convert SAM to BAM – samtools view –bS file.sam > file.bam • Sort BAM file – samtools sort file.bam file.sorted # (will create file.sorted.bam) • Index BAM file – samtools index file.sorted.bam • Convert BAM to SAM – samtools view file.bam > file.sam • Rsamtools • http://www.bioconductor.org/packages/2.6/bioc/html/Rsamtools.html SAMtools • Get alignment statistics – samtools flagstat pairendfile.bam 149923886 in total 0 QC failure 0 duplicates 124520915 mapped (83.06%) 149923886 paired in sequencing 74961943 read1 74961943 read2 120504218 properly paired (80.38%) 121586068 with itself and mate mapped 2934847 singletons (1.96%) 482748 with mate mapped to a different chr 143256 with mate mapped to a different chr (mapQ>=5) SAMtools • Get pileup – samtools pileup file.sorted.bam chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 T T G A A C C T C A G C T T T A A G A A 26 26 26 25 25 25 23 22 20 22 21 19 19 19 19 19 19 18 17 16 tTttTTTtTttTttTtTtTTGTTTTT tTttTTTtTttTttTtTtTTTTTTTT g$GggGGGgGggGggGgGgGGGGGGGG AaaAAAaAaaAaaAaAaAAAAAAAA AaaAAAaAaaAaaAaAaAAAAAAAA C$c$c$CCCcCccCccCcCcCCCCCCCC C$CCcCccCccCcCcCCCCCCCC^FC T$T$tTttTttTtTtTTTTTTTTT cCccCccCcCcCCCCCCCCC a$AaaAaaAaAaAAAAAAAAA^FA^FA G$g$g$GggGgGgGGGGGGGGGGG CccCcCcCCCCCCCCCCC^FC TttTtTtTTTTTTTTTTTT TttTtTtTTTTTTTTTTTT TttTtTtTTTTTTTTTTTT AaaAaAaAAAAAAAAAAAA A$aaAaAaAAAAAAAAAAAA g$gGgGgGGGGGGGGGGGG a$AaAaAAAAAAAAAAAA A$aAaAAAAAAAAAAAA ^ = start of read at that position ggggeggggg^Vgf_fggggJceb_g ggggfggggg[RgfNfgfgg`ed^]f gggg_ggggg[Ugfddgggga_eW\c gggaefggg_Xgf_fggggadd]Zg ggefggggdNVgbZbgggg`ee[\g gfgfggfggYYgeadgggg`ea^\g fgggge_`gf_dgggge_e]_gg ggffg\Rgf_dggeggde]_cg ggg`[gf_dggggg\d[]fg ged_]ggadffgggecX^ggfg ggc`gfWfggfggcaSdggfe agg\dgggggbZUdfgfgg eggcbfgfgg_cXdegfgg aggccggdggccZdggfgf `gfcfgggggccUcggcgg ege_fgggggcc[aggcgg XggLfggfggdeM_ggagg gf\fgggggcfPcggegg fce[gggg_eL]ggfdf dfggfggdfS[ggegg $ = end of read at that position SAMtools • Removing clonal reads – Multiple reads that map to same position, with same orientation as usually considered PCR duplicates – For mutation detection (less important for RNA-seq), need to collapse them into 1 read (e.g. read with highest quality score) – samtools rmdup –s file.bam file_noclonal.bam 5. Secondary Analysis (transcript level quantification, mutation calling) RPKM Reads per kilobase of transcript per million reads • R: Count how many reads map to a transcript • K: Divide by ( length of transcript / 1,000 ) • M: Divide by (total number of mapped reads in sample / 1,000,000 ) CuffLinks uses FPKM (same as RPKM, F=fragment, for paired end reads) CuffLinks cufflinks -p 4 –o outdir/ s_1_sequence.txt.sorted.bam Trapnell et al, 2010 http://genes.mit.edu/burgelab/miso/ http://www.broadinstitute.org/software/scripture/ Detecting Single Nucleotide Variations (SNVs) Short read AAAATACGCGTATTCTCCCAAAACAATATC TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT Reference Human Genome (hg18) Short read AAAATACGCCTATTCTCCCAAAACAATATC TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT Reference Human Genome (hg18) Short read AAAATACGCCTATTCTCCCATAACAATATC TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT Reference Human Genome (hg18) Sequencing has high error rate Mismatch = real variation OR sequencing error Short read AAAATACGCCTATTCTCCCAAAACAATATC TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT Reference Human Genome (hg18) Typical mismatch rate of entire datasets = 0.5-2% (errors >> real variations) Single Nucleotide Variation chr2, pos=85623221 bp Single Nucleotide Variation chr14, pos=35859525 bp Single Nucleotide Variation chr1, pos=220952447 Cancer mutations All cells in tumor have heterozygous mutation A fraction of cells have heterozygous mutation Loss of heterozygocity due to loss of genetic material The error/mismatch rate is not uniform across read length Mismatch Popular SNV calling programs • GATK http://www.broadinstitute.org/gsa/wiki/index .php/The_Genome_Analysis_Toolkit • VarScan • http://varscan.sourceforge.net/ SNVseeqer: Single Nucleotide Variation detection from deep sequencing data N reads at considered position p p98 p17 p14 p p65 p p1110 p p3 1 k reads with mutation genome Is k greater than expected by chance, given error rates pi ? SZ = Z1 + + ZN ìN ü P(SZ = k) = íÕ (1- pi )ý å wi1 ... w ik î i=1 þ i1 <...<ik The Poisson-Binomial distribution Wacker et al, 2012; Jiang et al, 2012 Chen & Liu, 1997 Indel calling • Complicated because indels often occur within microsatellite regions, eg CACACACA – CA--CACACA as good as CACA--CACA, CACACA--CA • Since reads are aligned independently, local realignment is needed • DINDEL (used in 1000 Genomes Project) http://www.sanger.ac.uk/resources/software/dindel/ Variant annotation • Variants can be either mutation or (more often) polymorphism. dbSNP catalogs all known polymorphisms • Missense, nonsense, intron, 3’UTR, 5’UTR, etc – SeattleSNP http://pga.gs.washington.edu/ • Severity of missense mutations – PolyPhen http://genetics.bwh.harvard.edu/pph2/ – Mutation Assessor http://mutationassessor.org/ • GATK for variant annotation http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_A nalysis_Toolkit • Cross-species conservation 6. Read data visualization SAMtools samtools tview file.sorted.bam wg.fa UCSC Genome Browser • Upload BAM file to genome browser or make it accessible to UCSC from your own web page Integrated Genome Viewer (IGV) Read densities genome Read count T A T T A A T T A T C C C C A T A T A TG A T A T genome Wiggle files for Genome Browser variableStep chrom=chr1 span=10 1471 0.3 1481 0.6 1491 0.6 1501 0.6 1511 0.6 1521 0.6 1531 1.1 1541 1.7 1551 1.9 1561 2.1 1571 2.5 1581 2.8 1591 3.2 1601 3.9 1611 3.9 1621 4.5 1631 4.8 1641 4.2 1651 3.9 1661 3.8 1671 3.2 1681 2.4 1691 1.9 1701 1.4 1711 1.3 1721 0.8 1871 1.4 1881 4.9 1891 9.1 1901 9.7 1911 10.7 1921 11.2 1931 12.3 http://genome.ucsc.edu/goldenPath/help/wiggle.html http://genome.ucsc.edu/goldenPath/help/bigWig.html 7. BioConductor packages for highthrougput sequencing BioC packages • IRanges http://bioconductor.org/packages/release/bioc/h tml/IRanges.html • Rsamtools http://bioconductor.org/packages/2.7/bioc/html/ Rsamtools.html • ShortRead http://bioconductor.org/packages/release/bioc/h tml/ShortRead.html • rtracklayer http://bioconductor.org/packages/2.8/bioc/html/ rtracklayer.html • BSgenome And many more SAMTools, Unix programs and R/BioC • RSAMtools • Unix commands can be ran in R system(“samtools rmdup –s file.bam file_noclonal.bam”) http://manuals.bioinformatics.ucr.edu/home/ht-seq 8. Challenges and evolution of sequencing and its analysis Storage is becoming a real problem Kahn, 2011, Science Sequencing is becoming faster Reads are becoming longer PacBio How do you interpret sequencing data in a clinical context ? Data integration ChIP-seq for BCL6, BCOR, SMRT, H3K79me2, H3K4me1, H3K4me3, H3K27Ac, H3K9Ac, H3K27me3, and DNA methylation (HELP) in LY1 cells HiC Integrative statistical model Predictions / Mechanisms Experiments ChIP-seq / siRNA etc The end • ole2001@med.cornell.edu • eug2002@med.cornell.edu