Agenda Friday, August 20th, 2010 9:00 – 10:30 RNA-seq 10:30 – 10:45 Morning Break 10:45 – 12:00 RNA-seq Hands On 12:00 – 1:00 Lunch Break 1:00 – 2:30 ChIP-seq analysis 2:30 – 2:45 Afternoon Break 2:45 – 4:00 ChIP-seq Hands on Demo 1 Copyright © Partek Inc. Whole Transcriptome Analysis With RNA-Seq Ryan Peters Field Application Specialist Who is Partek? • Founded in 1993 • Based in St. Louis, MO USA • Focused on Genomics • Thousands of customers worldwide • Building tools for both biologists and bioinformaticians 3 Copyright © Partek Inc. What is Partek Genomics Suite? • Desktop software - no server required • Supports multiple assays • Supports multiple assay providers • Enables Integrated Genomics • Competitively priced 4 Copyright © Partek Inc. Partek GS™ for Integrated Genomics Microarray Genome • Copy Number • Total & Allele Specific • Association • Loss of Heterozygosity & Next Generation Sequencing Transcriptome • Gene Expression • Exon/Alternative Splicing • DGE & mRNA –Seq 5 Regulation • ChIP-Chip • ChIP-Seq • microRNA Copyright © Partek Inc. Partek GS™ fits right into your Next Generation Sequencing Pipeline Data Acquisition Powerful Data Analysis and Visualization System software and alignment tools ELAND SHRiMP Corona BWA MAQ *Bowtie Publication RNA-seq SmallRNA-seq ChIP-seq DNA-seq 6 Copyright © Partek Inc. 7 Copyright © Partek Inc. Partek Recordings 8 Copyright © Partek Inc. Whole Transcriptome Analysis • Detect all known and novel RNAs in a transcriptome • Differential expression of mRNAs • Identification of alternative splicing events • Differential expression of non-coding RNAs • Coding SNPs discovery 9 Copyright © Partek Inc. Import Multiple Directories Soon 1 million reads/minute One format at a time *Quality Score option =Binary .SAM Now (.BAM) Max/Corola =Color space -> base space 10 Copyright © Partek Inc. Technical Replicates 11 Copyright © Partek Inc. Import .pdata 12 Copyright © Partek Inc. Import Next Gen Data • Select columns: 13 Copyright © Partek Inc. Import Next Gen Data Chromosome Alias File chr1 chr2 chr3 chr4 1 2 3 4 gi|224589800|ref|NC_000001.10| Homo sapiens chromosome 1, GRCh37 primary reference assembly 1 gi|224589811|ref|NC_000002.11| Homo sapiens chromosome 2, GRCh37 primary reference assembly 2 gi|224589815|ref|NC_000003.11| Homo sapiens chromosome 3, GRCh37 primary reference assembly 3 gi|224589816|ref|NC_000004.11| Homo sapiens chromosome 4, GRCh37 primary reference assembly 4 14 Copyright © Partek Inc. Import Next Gen Data .2bit file hg18.2bit/hg19.2bit 15 Copyright © Partek Inc. Append already imported data Merge .pdata tool 16 Copyright © Partek Inc. RNA-seq workflow in Partek GS 17 Copyright © Partek Inc. RNA-seq workflow in Partek GS 18 Copyright © Partek Inc. mRNA-seq Data Loaded into Partek • Sample attributes can be easily added to table to group biological replicates (if available) 19 Copyright © Partek Inc. Analyze Known Transcripts •Junction reads •Paired or single end reads •Multiple aligned reads •Strand-specific reads • Two main steps 1. Assign the reads to isoforms using modified E/M algorithm (Xing, Y. et al. Nucl. Acids Res. 2006 34:3150-3160) 2. Statistics to calculate p-value differential expression and alternative splicing 20 Copyright © Partek Inc. Transcript Level Mapping 21 Copyright © Partek Inc. Create Own Annotation /(.gff3) .pannot 22 Copyright © Partek Inc. E/M assignment of reads 1. Assume all isoforms are in equal abundance 2. Distribute exon reads between isoforms based on abundance 3. Recalculate isoform abundance based on read counts 4. Stop if isoform abundance is constant otherwise, return to step 2 23 Copyright © Partek Inc. 1st step E/M algorithm Isoforms: Raw Reads: 16 6 8 Exon read distribution Relative Isoform abundance 50% 8 50% 8 24 4 6 4 Copyright © Partek Inc. 2nd step E/M algorithm Isoforms: Raw Reads: 16 6 8 Exon read distribution Relative Isoform abundance 40% 8 60% 8 25 4 6 4 Copyright © Partek Inc. E/M algorithm Isoforms: Raw Reads: 16 6 8 Exon read distribution Relative Isoform abundance 26 Copyright © Partek Inc. Completed E/M Isoforms: Raw Reads: 16 6 8 Exon read distribution Relative Isoform abundance 25% 4 75% 12 2 6 6 Reads per kilobase for each isoform Orange: 6 Green: 2 Help > Online Tutorials > White Paper > RNA-Seq 27 Copyright © Partek Inc. Caveats • This is actually done across sets of overlapping transcripts; different genes sometimes share reads • e.g., Genes on different strands on assays which are not strand specific • This requires known isoforms • novel splicing cannot be quantified • Genes with few reads are not estimated as accurately • Simulation data showing increase coverage leading to increased accuracy from Xing, Y. et al. Nucl. Acids Res. 2006 34:3150-3160 28 Copyright © Partek Inc. Read Summary 29 Copyright © Partek Inc. A Transcript & Gene focused Data Views 30 Copyright © Partek Inc. A Transcript-focused Data View • Organized by NCBI mRNA identifiers (e.g., NM_080702) • Probability of differential transcript expression across groups • Probability of alternative splicing within a gene • Both Raw & Normalized read counts per sample • Log Likelihood test Transcript Gene level level 31 Copyright © Partek Inc. PCA and ANOVA *Biological Replicates 32 Copyright © Partek Inc. RNA-Seq Reads Distribution RPKM: Reads Per Kb exon length and Millions of mapped Reads 16.9316 1.766 X 1,000,000 = 0.849757 11,282,682 Other Transformations? 33 Copyright © Partek Inc. New Workflow Features • Exon Level Mapping • Alternative Splicing 34 Copyright © Partek Inc. Alternative Splicing *Biological Replicates 35 Copyright © Partek Inc. Soon to come - All in one • • • • • • Read Summary Transcript Results Transcript Focused View Gene Focused view Exon Focused View Exon Results Message to describe each spreadsheet and appropriate analysis 36 Copyright © Partek Inc. Integration Between Data Tables & Visualization Choice 1: From the workflow Choice 2: Row Header 37 Copyright © Partek Inc. Visualization of Differential Expression & Alternative Splicing 38 Copyright © Partek Inc. Strand-Specific Visualization Separate Forward and Reverse Reads 39 Copyright © Partek Inc. Creating Lists of Affected Transcripts 40 Copyright © Partek Inc. Biological Interpretation: Monitor Biological Trends with GO Enrichment 41 Copyright © Partek Inc. Biological Interpretation: Up-/Down-regulation of Biological Processes 42 Copyright © Partek Inc. Differential Expression of Non-coding RNAs SnoRNA, siRNA, miRNA, long non-coding RNA… Noncode.org; mirbase.org; Convert using ‘Manage Available Annotations’ 43 Copyright © Partek Inc. Differential Expression of Non-coding RNAs 44 Copyright © Partek Inc. SNP Discovery • .2bit file • Computationally intense • Log Odds ratio > 5.0 45 Copyright © Partek Inc. Variations Against Reference • The probability that the position is different from the reference is 10,000 times more likely than the position is the same as the reference 46 Copyright © Partek Inc. Variations Across Samples • Probability that at least one sample having a different genotype call is 10,000 times more likely than all the samples having the same genotype call 47 Copyright © Partek Inc. Overlap with known genes/features 48 Copyright © Partek Inc. Compare Detected cSNPs Against Known SNPs SNP Proportion – Detected SNP’s 49 Copyright © Partek Inc. SNP Proportion 50 Copyright © Partek Inc. Unexplained Peaks • Find locations on the genome that had reads mapped to it, but are not in our chosen annotation (i.e. RefSeq) • Find potential novel transcripts, exons • Set a threshold for minimum number of reads 51 Copyright © Partek Inc. Unexplained Peaks Result 52 Copyright © Partek Inc. Discover Novel Exons & Transcripts 53 Copyright © Partek Inc. Novel Exon in Intronic Region 54 Copyright © Partek Inc. Allele Specific Expression Allele Specific Expression Use Analysis of Variance to study allele specific expression based on the interaction of allele (A, T, G, C) counts and sample groups. 20 55 Copyright © Partek Inc. Copyright © Partek Inc. Integrated Genomics A few examples RNA-seq data and Exon array data 57 Copyright © Partek Inc. Integration of ChIP-seq & RNA-Seq data Combine Next-Generation ChIP-Seq and RNA-Seq data into one view. 58 Copyright © Partek Inc. RNA-Seq Hands On Data 4 Samples • Brain • Skeletal Muscle • Liver • Heart • Illumina Genome Analyzer • Aligned Using ELAND aligner allowing for up to 2 mismatches • Tutorial & Data Help > Online Tutorials > ‘Next Generation Sequencing tab’ 59 Copyright © Partek Inc. ChIP-Seq Analysis in Partek Genomic Suite What is ChIP-Seq? • ChIP – Seq = Chromatin Immunoprecipitation Sequencing • The sequencing of genomic DNA fragments that coprecipitate with a DNA-binding protein under study • ‘Unbiased’ – doesn’t rely prior knowledge of precise DNA binding sites (like ChIP-ChIP) • Results • The DNA sequence motif that is recognized by the binding protein • The regulatory sites for any transcription factor • Direct downstream targets of any transcription factor 61 Copyright © Partek Inc. Transcription Factors • DNA binding proteins that attach themselves to the genome with an affinity for a specific DNA sequence • Function: Bind to specific sites in the genome, recruit cofactors, and regulate transcription • ChIP-seq – identify binding transcription factor binding sites across entire genomes 62 Copyright © Partek Inc. Summary of ChIP Seq Assay 1. Collect and fractionate DNA 1 2-3.Enrich binding sites using IP 2 3b. PCR (Not Shown) 3 4. Sequence short reads 4 5. Align and detect peaks Photos: U.S. Department of Energy Genome Programs 63 Copyright © Partek Inc. ChIP-Seq Flow Chart Sequence Reads GAGGTTGCAGTTTG chr1 243919543 R ACTGCTCCGCCTCA chr16 49094914 F GAATAAAAAATCCA chr13 55882620 F CGTCCTTCACCCTCT chr13 110085165 R CCTTAAGGAAAGGA chr18 72273046 CAGCTAGGGTTGCC chr2 120786940 R CTGCTGGTGCTGCG chr10 73237323 Align Reads to Reference Genome Import F F Detect peaks Detect motifs 64 Copyright © Partek Inc. ChIP-Seq Workflow in Partek 65 Copyright © Partek Inc. Sample Data Set Study mapped the genomic binding sites of the NRSF transcription factor across the entire genome Two samples: NRSF-enriched ChIP sample (chip.txt) and control sample (mock.txt) DNA immunoprecipitated by a non-specific control antibody Johnson, et.al: Genome-Wide Mapping of in Vivo Protein-DNA Interactions (Vol. 316). New York, NY: Science. (2007) Other experiment setting can be also supported by Partek: •Multiple samples •Technical replicates •Biological replicates 66 Copyright © Partek Inc. Goals • Import ChIP-seq data • Calculate average fragment length of the IP samples • Detect and Visualize enriched regions in the genome • Discover Motif bindings site • Annotate enriched regions with overlapping genes • Look for enriched functional groups using GO Enrichment 67 Copyright © Partek Inc. Import 1 million reads/minute Multiple Directories Soon One format at a time *Quality Score Now (.BAM) 68 Copyright © Partek Inc. Import hg18.2bit/hg19.2bit 69 Copyright © Partek Inc. Imported ChIP-Seq Data 70 Copyright © Partek Inc. QA/QC--Fragment Length Analysis • Single end reads – phase shift between the forward and reverse reads • Maximum • Only on IP samples • Paired-end reads – distribution of fragment lengths between paired end fragments PCR Artefacts/Alignment Bias 71 Copyright © Partek Inc. Cross Correlation Fragment Length Estimation Probable Binding Forward Reads F Reverse Reads 3 2 1 2 2 2 0 0 0 0 0 0 0 0 0 3 3 2 1 2 72 R Copyright © Partek Inc. Detect Peaks Set Average fragment length (read extension length) Window Size Merge (Methyl., Histone) FDR cutoff Reference sample (need for SFC & Binomial p-value) Peak Detection Rate ~ 1 minute / 4 million reads 73 Copyright © Partek Inc. Peak Detection Forward Read Reverse Read Midpoint 100bp* 1) Extend Reads by Estimated frag length(single) 2) Find Midpoints 3) Divide Genome into windows of estimated fragment length or 100bp 4) Count number of reads in each window 5) Fit to ZTNB Single End Reads Paired End Reads 4 2 Chromosome 74 Copyright © Partek Inc. # Windows Peak Detection – Read Distribution 0 1 2 3 4 5 6 7 …… # of Midpoints 75 Copyright © Partek Inc. Detect Peaks Results Peaks are detected in each sample separately reported one peak at a time Mann-Whitney – Separation of forward and reverse reads Lower p-value = greater separation 76 Copyright © Partek Inc. Scaled Fold Change Scaled Fold Change(ChIP vs. Mock) = (1+ChIP)/(1+Alpha*Mock) Scaling Factor 9 Chip 8 ChIP Sample 7 Mock 6 3 4 2 10 5 6 5 4 3 2 1 0 0 2 4 6 8 10 Mock (Reference Sample) Best Fit Line Slope x2 Higher Scaled Fold Change = more enriched 77 Copyright © Partek Inc. ChIP-seq considerations • Peaks are detected on a per sample basis • Control samples are not required, but encouraged • # reads for each sample don’t have to match • Antibody selection • Must have specificity for the protein • Must be able to immunoprecipitate with target protein; even if they do, they may not do well with ChIP-seq • Sequencing – platform dependent bias, error rates • Algorithm – short tags ambiguous in repeat regions, account for sequence errors 78 Copyright © Partek Inc. Detected peaks 79 Copyright © Partek Inc. Chromosome Browser Help > Online Tutorials > ‘Chromosome Viewer User Guide’ 80 Copyright © Partek Inc. Create a List of Enriched Regions • Regions of DNA which have many reads mapped to them • They will occur only in our protein bound sample 81 Copyright © Partek Inc. Detecting motifs—Discover de novo motifs Height = binding importance; how well a base is preserved 82 Copyright © Partek Inc. Gibbs Motif Sampler Search for instances of Motif in Sequences 83 Create new Motif out of discovered instances Copyright © Partek Inc. 1. Randomly choose instances Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA…. Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA….. Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA….. CGTCGT GACGTA GGAGGG 84 Copyright © Partek Inc. 2. Create Count Matrix Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA…. Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA….. Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA….. CGTCGT GACGTA GGAGGG A C G T 011001 101100 220221 001011 85 Copyright © Partek Inc. 3. Find Motif Instances Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA…. Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA….. Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA….. GACCCT GGATTT GGAGGG A C G T 011001 101100 220221 001011 86 Copyright © Partek Inc. 4. Update Count Matrix Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA…. Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA….. Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA….. GACCCT GGATTT GGAGGG A C G T 012000 001110 321111 000112 87 Copyright © Partek Inc. 5. Repeat Until Convergence Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA…. Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA….. Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA….. GGACCT GGACCT GGACCT A C G T 003000 000030 330300 000003 88 Copyright © Partek Inc. Motif Instances LR(Prob sequence from motif vs. background distribution) 89 Copyright © Partek Inc. Detect Known Motifs IUPAC nucleotide code Base A Adenine C Cytosine G Guanine T (or U) Thymine (or Uracil) R A or G Y C or T S G or C W A or T K G or T ……. …… 90 Copyright © Partek Inc. Detecting Motifs -- Find Known Motif REST – another name for NRSF Probability of occurrence can be thought of as follows : (1) shuffle the bases in all the sequences. (2) Count the number of locations in the shuffled sequence that score above your threshold. 91 Copyright © Partek Inc. JASPAR Spreadsheet 92 Copyright © Partek Inc. Overlap with Databases • Databases of ChIP binding available from UCSC such as Oreganno • Databases of known genes such as RefFlat • Genes which overlap with peaks, nearby to motif instance 93 Copyright © Partek Inc. Find Overlapping Genes (PAZAR) soon -public database of regulatory sequences and transcription factors 94 Copyright © Partek Inc. Overlapping Genes 95 Copyright © Partek Inc. Biological Interpretation Overlappin g genes in a category All overlappin g genes vs All genes in the category All genes on genome 96 Copyright © Partek Inc. GO Browser: NRSF ChIP-Seq analysis 97 Copyright © Partek Inc. Hands on ChIP-seq Data Study mapped the genomic binding sites of the NRSF transcription factor across the entire genome Two samples: 1. NRSF-enriched ChIP sample (chip.txt) 2. control sample without immuno-enrichment (mock.txt) Johnson, et.al: Genome-Wide Mapping of in Vivo ProtenDNA Interations (Vol. 316). New York, NY: Science. (2007) Data and Tutorial Available for download: Help > Online Tutorials 98 Copyright © Partek Inc. 99 Copyright © Partek Inc. Sneak Preview Alignment/Import & Analysis Aligner Desktop/ Laptop 100 Copyright © Partek Inc. Sneak preview Histone Modification Methylation Workflow – current (Affy-tiling;Illumina-GX) 101 Copyright © Partek Inc. Sneak preview 102 Copyright © Partek Inc. Partek® Genomics Suite ™ • • • • • • Powerful Statistics with Interactive Visualization Fast*, Memory-efficient Easy to Use Enables integration of analysis, even between vendors Integrated Genomics can enhance your research pipeline Integrated with Public Genomic Resources: NCBI GEO, UCSC, Ensembl, Gene ontology, KEGG… Get your FREE trial today! Email www.partek.com 103 Copyright © Partek Inc.