6 Month Allelic Series RNAseq QC 1 QC summary QC was performed on all 192 samples focusing on determining failed or outlier samples. Four samples are recommended for omission from the final analysis dataset based on evidence of RNA degradation, PCA analysis, and model-based gene outlier detection. Those four samples can be found on slide 19. Additionally two correctable issues were identified. First, one flowcell worth of samples was run an additional time to add read depth to the 100 million required. This re-run was inadvertently run as 75-mers instead of 50-mer so the samples are a mix of read length. Secondly, for a subset of cortex samples (Q92 and Q140) there appears to be an infinitesimal but detectable amount of liver tissue. The overall dilution is 500-1000x, but given the extraordinary sensitivity of RNAseq this is still measureable. We have recommended a simple filter to remove those liver transcripts based on the fact that they have a recognizable correlation pattern (listed on slide 29), but other methods may be more sensitive. 2 How does CHDI QC RNAseq data in general? • Mostly we’re looking for outliers • Also showing overall experiment worked • When we find outliers, we try to determine the cause – That helps show it is an outlier and not part of the biology • Methods – Principal Components Analysis – RNA degradation plots – Paired end insert size – Read lengths – Read mapping efficiency – Repetitive sequences and their origin – Highly expressed genes – # gene outliers 3 PCA whole dataset • Not surprisingly, tissues cluster • Strong sex effect in liver • Cortex is tightly clustered 4 PCA striatum 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped • Q lengths cluster, good sign the design worked • Q92, 111, 140, 175 uniquely cluster • They even stagger in Q length order • Couple potential outliers (in red outline) 5 PCA cortex 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped • Only Q175 stands outside the main cluster • Possible Q175 outliers, but hard to be certain 6 PCA liver 450_Liver_Q175_HET_M_L8.LB1_1.clipped • Strong sex clustering will need to be accounted for • No strong Q clusters (sex masking?) • One potential outlier 7 Duplication in brain (representative examples) 60 50 50 50 40 30 20 10 0 30 20 10 3 5 7 9 10+ 30 20 10 0 1 40 DuplicationLevel 3 5 7 9 10+ 40 30 20 10 0 1 Percentage 60 40 Percentage 50 40 Percentage 50 Percentage Percentage 20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq 20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq 20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq 20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq 20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq 0 1 DuplicationLevel 3 5 7 9 10+ 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 1 DuplicationLevel 3 5 7 9 10+ DuplicationLevel 30 20 10 0 40 30 20 10 0 1 3 5 7 9 10+ 30 20 10 DuplicationLevel 3 5 7 9 10+ 50 50 40 40 30 20 10 0 1 60 Percentage 40 Percentage 50 40 Percentage 50 Percentage Percentage 20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq 20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq 20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq 20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq 20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq 0 1 DuplicationLevel 3 5 7 9 10+ 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 1 DuplicationLevel 3 5 7 9 10+ DuplicationLevel • Duplication is consistent and hovers between 13-24% • No red flags • Higher in striatum than cortex generally • Origin of the majority of the duplicated sequences is mitochondrial 50 40 40 30 20 10 0 30 20 10 0 1 3 5 7 9 10+ 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 50 Percentage 50 40 Percentage 50 40 Percentage 50 Percentage Percentage 20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq 20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq 20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq 20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq 20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 40 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ DuplicationLevel 1 3 5 7 9 10+ DuplicationLevel 20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq 20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq 20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq 20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq 20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq 10 0 20 10 0 1 3 5 7 9 10+ 50 50 40 40 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 50 Percentage 20 Percentage 30 30 Percentage 40 40 Percentage Percentage 50 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 40 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ DuplicationLevel 1 3 5 7 9 10+ DuplicationLevel 50 50 50 40 40 30 20 10 0 30 20 10 3 5 7 9 DuplicationLevel 8 10+ 30 20 10 0 1 40 0 1 3 5 7 9 DuplicationLevel 10+ Percentage 60 40 Percentage 50 40 Percentage 50 Percentage Percentage 20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq 20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq 21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq 21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq 21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq 30 20 10 0 1 3 5 7 9 DuplicationLevel 10+ 30 20 10 0 1 3 5 7 9 DuplicationLevel 10+ 1 3 5 7 9 DuplicationLevel 10+ Liver duplication (representative examples) 30 20 10 40 30 20 10 0 0 1 3 5 7 9 10+ 1 DuplicationLevel 3 5 7 9 40 35 30 25 20 15 10 5 0 10+ 60 60 50 50 Percentage 50 Percentage 60 40 Percentage 50 Percentage Percentage 520_Liver_Q111_HET_M_L8.LB12_1.clipped.fastq 520_Liver_Q111_HET_M_L8.LB12_2.clipped.fastq 522_Liver_Q111_HET_F_L8.LB13_1.clipped.fastq 522_Liver_Q111_HET_F_L8.LB13_2.clipped.fastq 524_Liver_Q111_HET_F_L4.LB14_1.clipped.fastq 40 30 20 10 DuplicationLevel 3 5 7 9 10+ 30 20 10 0 1 40 0 1 DuplicationLevel 3 5 7 9 10+ 1 DuplicationLevel 3 5 7 9 10+ DuplicationLevel 50 50 50 40 40 30 20 10 0 30 20 10 3 5 7 9 10+ 30 20 10 0 1 40 0 1 DuplicationLevel 3 5 7 9 10+ Percentage 60 40 Percentage 50 40 Percentage 50 Percentage Percentage 524_Liver_Q111_HET_F_L4.LB14_2.clipped.fastq 526_Liver_Q111_HET_F_L4.LB15_1.clipped.fastq 526_Liver_Q111_HET_F_L4.LB15_2.clipped.fastq 528_Liver_Q111_HET_F_L6.LB16_1.clipped.fastq 528_Liver_Q111_HET_F_L6.LB16_2.clipped.fastq 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 1 DuplicationLevel 3 5 7 9 10+ DuplicationLevel • Liver duplication is much higher, 40-50% • Major duplicated sequences are all mouse pheromone receptors (Mup1-21) • Hurts our true read depth, but nothing terrible • Should keep in mind for future liver work 60 50 40 40 50 40 30 20 10 0 30 20 10 0 1 3 5 7 9 10+ 30 20 10 DuplicationLevel 3 5 7 9 10+ 40 30 20 10 0 1 50 Percentage 50 Percentage 50 Percentage 60 Percentage Percentage 642_Liver_Q20_HET_M_L3.LB18_1.clipped.fastq 642_Liver_Q20_HET_M_L3.LB18_2.clipped.fastq 644_Liver_Q20_HET_M_L4.LB19_1.clipped.fastq 644_Liver_Q20_HET_M_L4.LB19_2.clipped.fastq 646_Liver_Q20_HET_M_L6.LB20_1.clipped.fastq 0 1 DuplicationLevel 3 5 7 9 10+ 40 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 1 DuplicationLevel 3 5 7 9 10+ DuplicationLevel 50 50 40 40 30 20 40 30 20 10 10 0 0 1 3 5 7 9 10+ 40 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 50 Percentage 60 50 Percentage 60 50 Percentage 60 Percentage Percentage 646_Liver_Q20_HET_M_L6.LB20_2.clipped.fastq 648_Liver_Q20_HET_M_L1.LB21_1.clipped.fastq 648_Liver_Q20_HET_M_L1.LB21_2.clipped.fastq 650_Liver_Q20_HET_F_L7.LB22_1.clipped.fastq 650_Liver_Q20_HET_F_L7.LB22_2.clipped.fastq 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 40 30 20 10 0 1 DuplicationLevel 3 5 7 9 10+ 1 DuplicationLevel 3 5 7 9 10+ DuplicationLevel 50 50 50 40 40 40 30 20 10 0 30 20 10 3 95 7 9 DuplicationLevel 10+ 30 20 10 0 1 40 0 1 3 5 7 9 DuplicationLevel 10+ Percentage 60 40 Percentage 50 50 Percentage 60 Percentage Percentage 652_Liver_Q20_HET_F_L2.LB23_1.clipped.fastq 652_Liver_Q20_HET_F_L2.LB23_2.clipped.fastq 654_Liver_Q20_HET_F_L1.LB25_1.clipped.fastq 654_Liver_Q20_HET_F_L1.LB25_2.clipped.fastq 656_Liver_Q20_HET_F_L6.LB27_1.clipped.fastq 30 20 10 0 1 3 5 7 9 DuplicationLevel 10+ 30 20 10 0 1 3 5 7 9 DuplicationLevel 10+ 1 3 5 7 9 DuplicationLevel 10+ 5’ -> 3’ degradation charts (representative examples) 20914_449L_striatum_Q175_WT_M_L1.LB1_1.clipped 20921_450L_cortex_Q175_HET_M_L2._1.clipped 20927_450L_striatum_Q175_HET_M_L1.LB2_1.clipped 20940_451L_striatum_Q175_WT_M_L6.LB3_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped Color by TranscriptBin 1-499 0.6 0.6 0.6 0.6 0.6 500-999 0.4 0.4 0.4 0.4 0.4 1000-1999 0.2 0.2 0.2 0.2 0.2 0 0 0 0 0 2000-2999 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 3000-3999 20953_452L_striatum_Q175_HET_M_L8.LB4_1.clipped 20966_453L_striatum_Q175_WT_M_L2.LB5_1.clipped 20973_454L_cortex_Q175_HET_M_L4.LB3_1.clipped 20979_454L_striatum_Q175_HET_M_L4.LB6_1.clipped 20992_455L_striatum_Q175_WT_M_L2.LB7_1.clipped 4000-4999 1 1 1 1 1 0.8 0.8 0.8 0.8 0.8 5000+ 1 1 1 1 1 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0 0 0 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 20999_456L_cortex_Q175_HET_M_L5.LB4_1.clipped 21005_456L_striatum_Q175_HET_M_L5.LB8_1.clipped 21018_457L_striatum_Q175_WT_F_L1.LB9_1.clipped 21025_458L_cortex_Q175_HET_F_L6.LB6_1.clipped 21031_458L_striatum_Q175_HET_F_L8.LB10_1.clipped 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Displays the likelihood of getting full length transcripts for various mRNA lengths • Very high quality samples in general • Most samples show >70% of all mRNA molecules are >80% complete • Liver on average more degraded • Some samples have degradation in the longer mRNA species (one marked in red) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 21044_459L_striatum_Q175_WT_F_L6.LB11_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped 21057_460L_striatum_Q175_HET_F_L8.LB12_1.clipped 21070_461L_striatum_Q175_WT_F_L3.LB13_1.clipped 21077_462L_cortex_Q175_HET_F_L7.LB7_1.clipped 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 21083_462L_striatum_Q175_HET_F_L6.LB14_1.clipped 21096_463L_striatum_Q175_WT_F_L1.LB15_1.clipped 21103_464L_cortex_Q175_HET_F_L3.LB8_1.clipped 21109_464L_striatum_Q175_HET_F_L5.LB16_1.clipped 23346_513L_striatum_Q111_WT_M_L4.LB18_1.clipped 1 1 1 1 1 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0 0 0 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 10 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 0 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Bin1 Bin18 Bin35 Bin52 Bin69 Bin86 Suspect samples by 5’ -> 3’ degradation 454_Liver_Q175_HET_M_L3._1.clipped 456_Liver_Q175_HET_M_L7.LB4_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 845_Liver_Q92_HET_F_L6.LB25_1.clipped 522_Liver_Q111_HET_F_L8.LB13_1.clipped 452_Liver_Q175_HET_M_L1.LB2_1.clipped 776_Liver_Q140_HET_F_L8.LB13_1.clipped 716_Liver_Q80_HET_F_L7.LB6_1.clipped 843_Liver_Q92_HET_F_L6.LB23_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped 11 GC content per read has a red flag 20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq 20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq 20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq 20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq 20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq 2000000 2000000 1000000 1000000 0 0 0 10 20 30 40 50 Count 3000000 Count Count 3000000 60 3000000 3000000 5000000 2500000 2500000 4000000 2000000 2000000 1500000 1000000 500000 10 20 GC# 30 40 50 60 1500000 1000000 0 2000000 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 GC# GC# GC# 3000000 1000000 500000 0 0 Count 4000000 Count 4000000 0 10 20 30 40 50 60 GC# 4000000 3000000 2000000 1000000 3000000 2000000 1000000 0 10 20 30 40 50 60 3000000 2000000 10 20 GC# 30 40 50 60 4000000 2000000 1000000 0 0 5000000 3000000 1000000 0 0 4000000 Count 5000000 4000000 Count 5000000 4000000 Count 5000000 Count Count 20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq 20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq 20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq 20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq 20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq 10 20 GC# 30 40 50 60 2000000 1000000 0 0 3000000 0 0 10 20 GC# 30 40 50 60 0 10 20 GC# 30 40 50 60 GC# 5000000 4000000 4000000 3000000 2000000 3000000 2000000 3000000 2000000 2000000 1000000 1000000 1000000 0 0 0 0 10 20 30 40 50 60 0 10 20 GC# 30 40 50 60 0 10 20 GC# 30 40 50 3000000 3000000 1000000 0 4000000 60 8 of the samples have a “shoulder” in the GC# chart This is usually a really bad thing • Suggests non-mouse or non-biological sequence Count 5000000 4000000 Count 5000000 4000000 Count 5000000 Count Count 20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq 20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq 20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq 20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq 20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq 2000000 1000000 0 0 10 20 GC# 30 40 50 60 0 10 20 GC# 30 40 50 GC# 20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq 20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq 20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq 20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq 20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq 1000000 5000000 5000000 4000000 4000000 4000000 3000000 2000000 1000000 0 10 20 30 40 50 60 2000000 1000000 0 0 3000000 10 GC# 20 30 40 50 60 4000000 3000000 2000000 1000000 0 0 5000000 Count 2000000 5000000 Count Count Count 3000000 Count 4000000 10 20 GC# 30 40 50 60 2000000 1000000 0 0 3000000 0 0 10 20 GC# 30 40 50 60 0 10 20 GC# 30 40 50 60 GC# 0 10 20 12 30 GC# 40 50 60 5000000 5000000 4000000 4000000 4000000 3000000 2000000 1000000 10 20 30 GC# 40 50 3000000 2000000 1000000 0 0 Count 1500000 1000000 500000 0 5000000 Count 1500000 1000000 500000 0 4000000 3500000 3000000 2500000 2000000 Count 4000000 3500000 3000000 2500000 2000000 Count Count 20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq 20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq 21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq 21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq 21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq 10 20 30 GC# 40 50 60 2000000 1000000 0 0 3000000 0 0 10 20 30 GC# 40 50 60 0 10 20 30 GC# 40 50 60 Those same samples flag for read length as well 30000000 60000000 25000000 25000000 50000000 40000000 40000000 20000000 20000000 40000000 30000000 20000000 10000000 30000000 20000000 10000000 0 10 20 30 40 50 60 10000000 5000000 0 0 15000000 SequenceLength 10 20 30 40 50 60 15000000 10000000 5000000 0 0 Count 30000000 50000000 Count 60000000 50000000 Count 60000000 Count Count 20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq 20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq 20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq 20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq 20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq 20000000 10000000 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 SequenceLength SequenceLength SequenceLength 30000000 0 10 20 30 40 50 60 SequenceLength 35000000 60000000 60000000 30000000 50000000 40000000 40000000 50000000 25000000 30000000 20000000 10000000 30000000 20000000 10000000 0 0 0 10 20 30 40 50 60 40000000 30000000 SequenceLength 10 20 30 40 50 20000000 15000000 20000000 10000000 10000000 5000000 0 0 60 Count 70000000 50000000 Count 60000000 50000000 Count 60000000 Count Count 20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq 20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq 20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq 20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq 20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq SequenceLength 10 20 30 40 50 60 30000000 20000000 10000000 0 0 40000000 0 0 SequenceLength 10 20 30 40 50 60 0 SequenceLength 10 20 30 40 50 60 SequenceLength 60000000 60000000 50000000 50000000 50000000 40000000 40000000 40000000 40000000 30000000 20000000 10000000 30000000 20000000 10000000 0 10 20 30 40 50 60 20000000 10000000 0 0 30000000 SequenceLength 10 20 30 40 50 60 30000000 20000000 10000000 0 0 Count 60000000 50000000 Count 60000000 40000000 Count 50000000 Count Count 20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq 20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq 20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq 20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq 20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq SequenceLength 10 20 30 40 50 60 0 Those same samples have a mix of 50mer reads and 75mer reads That’s very odd 0 SequenceLength 20000000 10000000 0 0 30000000 10 20 30 40 50 60 0 SequenceLength 10 20 30 40 50 60 SequenceLength 20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq 20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq 20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq 20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq 20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq 20000000 10000000 0 70000000 60000000 60000000 60000000 60000000 50000000 50000000 50000000 50000000 40000000 40000000 40000000 30000000 10 20 30 40 50 30000000 20000000 20000000 10000000 10000000 0 0 40000000 60 SequenceLength 10 20 30 40 50 60 30000000 20000000 10000000 0 0 Count 30000000 70000000 Count Count Count 40000000 Count 50000000 SequenceLength 10 20 30 40 50 20000000 10000000 0 0 30000000 0 At this point we asked our sequencing lab for clarification on what happened 60 0 SequenceLength 10 20 30 40 50 60 0 SequenceLength 10 20 30 40 50 60 SequenceLength 60000000 60000000 40000000 50000000 50000000 40000000 40000000 30000000 20000000 10000000 30000000 20000000 10000000 0 10 20 30 40 50 SequenceLength 13 60 20000000 10000000 0 0 30000000 10 20 30 40 50 SequenceLength 60 30000000 20000000 10000000 0 0 Count 50000000 40000000 Count 50000000 40000000 Count 50000000 Count Count 20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq 20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq 21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq 21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq 21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq 10 20 30 40 50 SequenceLength 60 20000000 10000000 0 0 30000000 0 0 10 20 30 40 50 SequenceLength 60 0 10 20 30 40 50 SequenceLength 60 Our sequencing partner found the cause The 8 suspect samples 264401 20921_1_450L_cortex_Q175_HET_M 264416 23535_1_528L_cortex_Q111_HET_F 264418 28243_1_644L_cortex_Q20_HET_M 264397 35624_1_844L_striatum_Q92_WT_F 264447 35631_1_845L_cortex_Q92_HET_F 264448 35657_1_847L_cortex_Q92_HET_F 264451 454_Liver_Q175_HET_M 264455 462_Liver_Q175_HET_F 20130523 V02604 20130523 V02604 20130523 V02604 20130523 V02604 20130604 V02761 20130604 V02761 20130604 V02761 20130604 V02761 VIRT VIRT VIRT VIRT VIRT VIRT VIRT VIRT 2 3 4 1 1 2 3 4 For these 8 samples, the initial run didn’t get a full 100 million reads. When that happens the lab runs the samples again and then merges the run into a full “virtual run” of the full read depth we paid for. That’s all good. The strange thing that happened to us this time was that the run they added our 8 samples to (they add it to ongoing flow cells) happened to be a 75mer run. Again no big problem usually, and what they do is clip off 25 bases in their processing and all is compatible. This specific time they forgot to trim, so we saw the ugly intermediate state. What this means is that the data for those 8 are fine. They are longer, but still good reads from our samples. 14 Mitochondrial rate in brain 8-9% of the reads are mtRNA nothing surprising there 15 Mitochondrial rate in liver 6-7% of reads are mtRNA Again in line with expectations 16 Other QC parameters that looked great • Insert sizes: All right around 175 as expected • Sense/antisense sequence ratio: 1:1 as expected • Sequence coverage – 40% of mouse transcriptome detected in brain – About 30% of mouse transcriptome detected in liver • Mapped read rate in the upper 90s – 98% for brain, 96% for liver • 95-97% of our reads are mapped to known genes – 3-5% intergenic regions 17 Model based outlier detection 8000 7000 Method by which we look for the number of genes that are outliers after accounting for our modeled effects • 2 samples stand out, and additional 4-6 are suspect, but probably OK (Q92 Het males) 6000 5000 4000 3000 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 2000 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped 1000 0 14_449L_striatum_Q175_WT_M_L1.LB1_... 18 23385_516L_striatum_Q111_HET_M_L1.LB2... 28295_648L_cortex_Q20_HET_M_L8.LB21_1... 30772_715L_striatum_Q80_WT_F_L3.LB11_... 33205_781L_striatum_Q140_WT_F_L6.LB6_... Integrating the sample QC to choose omissions A very simple way to determine what to throw out is to look for multiple strikes against a sample 5’ -> 3’ charts 454_Liver_Q175_HET_M_L3._1.clipped PCA outliers Model based outliers 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clippe 456_Liver_Q175_HET_M_L7.LB4_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 845_Liver_Q92_HET_F_L6.LB25_1.clipped 522_Liver_Q111_HET_F_L8.LB13_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped 452_Liver_Q175_HET_M_L1.LB2_1.clipped 776_Liver_Q140_HET_F_L8.LB13_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped 716_Liver_Q80_HET_F_L7.LB6_1.clipped 843_Liver_Q92_HET_F_L6.LB23_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped 19 450_Liver_Q175_HET_M_L8.LB1_1.clipped Final list of proposed samples for omission 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped 20 Liver contamination in cortex Q140 and Q92? While the sequencing lab was looking into the 75mer issue I ran cortex through some basic statistical modeling (omitting the samples mentioned previously) I found changes, but the pattern and biology was all wrong 21 Alb 8 7 Logged FPKM 6 5 Every single change is an increase Completely off in Q175, 111, 80, and 20 On (but not that strongly in 140 and 92) It’s make no sense for Q111 to be skipped and for Q175 to go back to normal 4 3 2 Albumin is the top hit? Isn’t Albumin liver specific? 1 0 22 ENSMUSG00000029368 Color by Q Length Q111 Q140 Q175 Q20 Q80 Q92 Some of the other changed genes are suspicious • • • • • Albumin ApoC3, C1, Mup3, 10, 18, 19 FABP1 Urate oxidase All reasonably solid liver markers DAVID functional annotation also suggests the altered genes are liver related (p < 10-5) 23 9000000 7000000 5000000 4000000 A subset of genes with good correlation between liver and cortex but shifted from the 1:1 axis 3000000 2000000 900000 700000 500000 400000 300000 200000 90000 70000 50000 40000 768_Liver_Q140_HET_M_L3.LB9_1.clipped 30000 20000 9000 7000 5000 4000 3000 2000 900 700 500 400 300 200 90 70 50 40 30 20 9 7 5 4 3 2 1 1 3 24 7 20 50 200 500 2000 5000 20000 33030_768L_cortex_Q140_HET_M_L4.LB9_1.clipped 50000 200000 500000 2000000 5000000 9000000 7000000 5000000 4000000 3000000 Same chart with the “significant” genes in red 2000000 900000 700000 500000 400000 300000 200000 90000 70000 50000 40000 768_Liver_Q140_HET_M_L3.LB9_1.clipped 30000 20000 9000 7000 5000 4000 3000 2000 900 700 500 400 300 200 90 70 50 40 30 20 9 7 5 4 3 2 1 1 3 25 7 20 50 200 500 2000 5000 20000 33030_768L_cortex_Q140_HET_M_L4.LB9_1.clipped 50000 200000 500000 2000000 5000000 9000000 7000000 5000000 4000000 3000000 Same chart and shading in Q111, notice the Lack of linear correlation 2000000 900000 700000 500000 400000 300000 200000 90000 70000 50000 40000 514_Liver_Q111_HET_M_L8.LB9_1.clipped 30000 20000 9000 7000 5000 4000 3000 2000 900 700 500 400 300 200 90 70 50 40 30 20 9 7 5 4 3 2 1 1 3 26 7 20 50 200 500 2000 5000 20000 23353_514L_cortex_Q111_HET_M_L1.LB9_1.clipped 50000 200000 500000 2000000 5000000 What we suspect happened • The basic problem is that liver specific transcripts should not have correlated expression in cortex • A very small amount of liver contamination has occurred. The shift is 500 to 1000 times lower than normal liver expression • What this means is only the absolute highest liver expressed genes are detected at all • The challenge is uniquely identifying the affected genes FPKMs of albumin, which should not exist in brain Cortex Albumin 27 Striatum Liver 173 0.01 40979 9000000 7000000 5000000 4000000 3000000 2000000 900000 700000 Liver filter created as 500000 400000 • • • 300000 200000 90000 70000 Liver mean count > 2000 Mean ratio of liver to cortex > 500 Cortex count > 0 770_Liver_Q140_HET_M_L2.LB10_1.clipped 50000 40000 30000 Not a bad first approximation 20000 9000 7000 5000 4000 3000 2000 900 700 500 400 300 200 90 70 50 40 30 20 9 7 5 4 3 2 1 1 3 28 7 20 50 200 500 2000 5000 20000 33056_770L_cortex_Q140_HET_M_L7.LB10_1.clipped 50000 200000 500000 2000000 5000000 Effect of filtering out the liver specific genes from the cortex data 90 80 70 60 50 Hits pre-filter Hits post-filter 40 30 20 10 0 Q80 29 Q92 Q111 Q140 Q175 Summary of QC • All but 4 of the 192 samples can move forward to the analysis • A filter to clear out highly expressed liver genes is needed for the cortex Q140 and Q92 sets • Striatum PCA plots show that CAG length is the single largest global element of variance! 30